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STATISTICAL TREATMENT OF THE DATA OF 
BLOOD CONCENTRATION OF A DRUG 


D. VAN DER REYDEN 


Section of Biometry, Onderstepoort Research Institute, 
Union of South Africa* 


1. Summary. Considering the case of a single dosage administered 
enterally to an individual, a mathematical model is derived for the 
blood concentration of a drug in terms of absorption and excretion. A 
transformation of the experimental data is described which allows the 
practical evaluation of this model by the method of least squares. The 
coefficients of the model so estimated are combined to give statistics 
which have a geometrical, statistical, and physiological meaning. It is 
shown that these statistics allow a valid comparison of two or more 
concentration curves, in terms of an efficiency of treatment scale. This 
scale is a segment of the standardized normal distribution. 


2. Introduction. Blood concentration values of any drug administered 
enterally, and measured at suitable time intervals, follow a fairly consis- 
tent pattern. They start at zero, increase rapidly to a maximum, and 
then decrease more slowly to zero asymptotically. The object of this 
paper is to establish the form of this curve mathematically and, from the 
model, to derive statistics measuring relevant characteristics of the 
process. Such statistics will allow valid comparisons to be made 
between the concentration patterns resulting from different treatments. 
The mathematical procedure consists of the following steps: (a) a hypo- 
thetical variable, consisting of those absorption values which would have 
been obtained had no excretion taken place, is introduced and explicitly 
formulated in terms of time (b) the quantities of the dose actually 
excreted are also expressed as a mathematical function of time, and 
(c) the subtraction of actual excretion values from hypothetical absorp- 
tion values determines the mathematical form of the concentration 
curve. The case of a single dosage only is considered here. 


3. Experimental. The data used as illustrations in this article derive 
from a practical investigation by Clark (1951) into the influence of diet 
on sulphanilamide concentration in the blood. Since full experimental 
details will be found in his paper, it is sufficient for the purposes of this 
article to know that equivalent doses of sulphanilamide were adminis- 
tered enterally to individuals of groups on different diets, and that the 
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influeuce of diet on blood concentration was determined by comparison 
of the different concentration-time curves. 


4. Formulation. Let K be the single dose of the drug in grams, x that 
amount of K which will be absorbed by the blood after lapse of ¢ time 
units if no excretion takes place, and z the total amount of the drug 
excreted in the urine in ¢ time units. Let V be the blood volume of the 
individual in ¢.c., and y the concentration of the drug in Mgm/100 c.c. 
of the blood at time ¢. 

From these definitions it follows that 


y = (x — 2)10°/V (1) 


and the mathematical formulation for concentration will now be known 
when x and z have been mathematically expressed in terms of time. 

Consider excretion of the original dose after such a length of time 
that, for all practical purposes, it may be taken as equal to zero. The 
general form which the two variables will take can now be seen to be of 
the nature of monotonically increasing functions. By definition the 
values of x will always be larger than corresponding values of z, except at 
zero and infinite time, when both are respectively equal to zero and K. 
Mathematically these conditions are satisfied by an expression of the 
form 


P= ~ 1+ 4/0 (2) 


where f(t) is an, as yet unknown, function of ¢ and A is a constant. 
Re-arranging terms 


F_ _ tf 
K-F A 


The left-hand side of this relationship is a monotonic function in- 
creasing from zero to infinity through unity, at which point F is equal 
to 3K. Since ¢ is always positive, f(t) must therefore also be a monotonic 
increasing function. An explicit form for f(é) must now be determined. 
This can be done as follows. If any proper fraction of K is taken, say 
F = (m/n) K, this value of the ordinate will lie at the time-point given 
by the solution of 


(2a) 


if) =—"_ A (2b) 


An easy solution of this equation is one where f(t) is assumed to be a 
simple polynomial, ¢’ say. Furthermore, if the general form of relation 
(2a) is considered, it is clear that choosing the exponent r equal to unity 
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will in a large number of practical cases yield a good approximation, 
since this choice takes into account the possible curvature of the left- 
hand side of (2a) when plotted against time. The correctness of this 
choice can of course be tested statistically from the experimental data. 
This is shown in Table I where the observed values of the total amount 
of the drug in gms. excreted by an individual in the urine in ¢ time units 
are compared with the values calculated from relation (3a). 


TABLE I. 
OBSERVED AND CALCULATED EXCRETION VALUES IN GM. 
Time in hrs. Observed Calculated Difference 
3 04 .04 00 
6 14 .14 00 
9 .55 .40 15 
12 .63 .51 12 
27 1.89 1.88 01 
30 2:12 2.15 — .03 
33 2.32 2.46 —.14 
51 3.53 3.57 | —.04 
75 4.41 4.40 01 


With the exponent equal to unity, the formulations for absorption 
and excretion, as defined in (2) become 


Kt 

Kt 


where ~/b and +/a are those values of ¢ at which the absorption and 
excretion respectively have attained half the value of the original dose, 
viz. 3K. The value of b will always be considerably smaller than the 
value of a. The amount of the drug in the blood at any time ¢ should 
now, in terms of the basic definition, be equal to the difference between 
x and z. If concentration is expressed in mgm. per 100 c.c. of blood, and 
if relative dose is defined as k = K-10°/V then from (1) the formula for 
blood concentration of the drug becomes 


_ 


This subtraction process is shown graphically in Figure 1. In the 
upper graph the differences between the absorption and excretion values 
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are shown by the dotted vertical lines. The ordinates of the lower 
graph, the concentration curve, are proportional to these lines by an 
amount k(a — 


CONCENTRATION 
- nar 


o 


20 30 60 70 80 90 100 

TIME IN HOURS 

FIGURE 1. BLOOD CONCENTRATION VALUES PROPORTIONAL TO DIFFERENCE 
BETWEEN ABSORPTION AND EXCRETION VALUES. 


Since the relative dose k and the coefficient a are usually unknowns, 
they can be combined into one unknown by writing the concentration 
curve as 


ct’ 


(4a) 
so that 
c = k(a — b) (4b) 


5. Characteristics of the Blood Concentration Curve. The mathematical 
reproduction of the concentration values thus depends only on the 
magnitudes of the two coefficients a and b, and the size of the relative 
dose k. However, the purpose of statistical method is not always to 
reproduce the data but rather to evaluate it in terms of the object of 
the experiment by selection of appropriate characteristics. Every char- 
acteristic selected will be some function of the coefficients and conversely, 
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any function of the coefficients will be some characteristic of the curve. 
Which are now the important characteristics? 

Although the answer depends to a certain extent on the object of 
the experiment, very few of these objects will be achieved without using 
in some way or other the technique of comparison of curves. In other 
words, those characteristics which will allow valid comparisons to be 
made between two or more concentration curves must be selected. 
Generally speaking, the statistics of comparison requires the use of at 
least three basic measures, a point value (e.g. the mode or mean), a line 
value (e.g. the probable error or standard deviation), and a frequency 
value (e.g. number of observations). Furthermore, the point value and 
the line value should in general be statistically independent. Since the 
general form of the concentration curve resembles that of a frequency 
distribution, the problem is whether analogous characteristics can be 
found for the concentration curve and, if so, what biological interpreta- 
tions can be attached to such measures. Conversely, if measures be 
selected on biological grounds, will these measures satisfy the require- 
ments for comparison? 

For instance, in establishing the efficiency of different drugs or treat- 
ments from a biological point of view, the concentration curve of the 
most efficient drug or treatment should conform to the following require- 
ments: (a) the time interval between zero and maximum concentration 
should be the least (b) the time interval during which the concentration 
is not less than some required value should be the largest, and (c) the 
maximum concentration should be such that the relative doses should be 
equal. 

These measures will now be set up mathematically and then tested 
for validity in terms of comparison requirements. Maximum concentra- 
tion occurs when the difference between absorption and excretion is a 
maximum. Thus at the point where (x — 2z) is a maximum, 
dx/dt = dz/dt from which equation we obtain the time point of maximum 
concentration as 


t,, = (5) 


Substitution gives the value of maximum concentration as 


_ Vb) 


y (6) 


"(a+ vd) 
I= Va- Vb 
T= Vat Vb 


or, defining 
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maximum concentration can be expressed as 


Ym = (6a) 


Interpreted in terms of the absorption and excretion curves I can be 
called the half-dosage interval, i.e. the interval during which absorption 
is larger than K/2 and excretion is less than K/2. Analogously 7 can be 
termed the half-concentration interval, since T is the difference between 
the abscissae of the points of intersection of the line y = 4y,, with the 
concentration curve. Thus 7 is the period of time in which the concen- 
tration is not less than half the maximum concentration. 

The three characteristics, ¢,, , 7’, and y,, have a geometrical meaning 
and can be estimated directly from the graph of the blood concentration 
curve as illustrated in figure 2, where the concentration curves with their 
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FIGURE 2. BASIC CHARACTERISTICS OF CONCENTRATION CURVE 


characteristics of two individuals on different diets are contrasted. 
Furthermore these statistics satisfy the requirements of comparison 
since they are respectively the analogues of the mode, the probable error, 
and the maximum of a frequency distribution. 
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6. Interrelationship of Characteristics: a measure of drug efficiency. It 
will now be shown that a measure of the efficiency of a drug administered 
enterally can be obtained from the interrelationship of the characteristics 
of the concentration curve. Subtracting the square of the relative dose, 
viz. k’ from both sides of (6a) squared, eliminating J’ by the relation 
t2 = i(T° — I’) and putting = we obtain 


( (6b) 
Since T,,,/T is the ratio of the geometric mean to the arithmetic mean 
of the coefficients +/a, +~/b 7’, will always be smaller than 7. Therefore 


T,,,/T will always be smaller than 1. Taking logarithms of both sides 
of (6b), putting y’ = y,,/k+/2m we obtain as first approximation 


an expression relating the three characteristics. 

This expression is the well-known standardized normal distribution. 
Given the ratio 7,,,/T in any specific case the corresponding normalized 
maximum concentration value can be read off from standard tables. It 
should be observed that since 7',,/T’ < 1 the above relationship is that 
segment of the normal distribution lying between the origin and one 
standard deviation. Any concentration curve measured by these three 
characteristics will now appear as a point on the graph in figure 3. It 
follows that any comparison between groups of concentration curves is 
equivalent to a comparison of groups of points on the normal curve. 
The closer any point lies to the origin, the more efficient that specific 
treatment. This will happen when the ratio 7,,,/T is small, that is, 
when 7’,, is small and 7’ large—our requirement for efficiency. 


7. Estimation by least squares. The three directly measurable charac- 
teristics of the concentration curve were derived above in terms of the 
coefficients a and b. By eliminating +~/a and +~/b from the expressions 
for t,, and T it is easily seen that the numerical values of ~/a and ~/b 
are given by the solutions of the quadratic equation 


wv (7) 


Also, by appropriate combination of (4b) and (6a) an estimate of ¢ can 
be obtained from the relation 


(8) 
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FIGURE 3. TREATMENT EFFICIENCY SCALE 


Estimates of the coefficients a, b, and c, can thus be obtained by graphical 
means as can be seen from figure 2. This is however, a highly subjective 
procedure and can only be looked upon as a first approximation. An 
objective method is provided for instance, by the method of least 
squares. 

If the formula found for blood concentration of the drug is appropri- 
ate, its relation to the experimental data must of necessity be of the type 


y=9te (9) 
where 
ct’ 


(4a) 


and ¢ is a random series, characterized by zero mean and minimum 
variance. The estimation of the coefficients of (4a) by the method of 
least squares entails the minimization of Y«’, which process should lead 
to three solvable simultaneous equations in the coefficients, or known 
combinations of the coefficients. Since the simultaneous equations 
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obtained by minimizing the above formulation cannot be “linearized”’, 
the direct application of the method of least squares is impossible. A 
transformation of (4a) must therefore be found that will allow the solu- 
tion of the simultaneous equations resulting from the minimization of 
the transformed variable. 


TABLE II. 
COMPARISON OF SYSTEMATIC AND RANDOM RESIDUALS 

Transformed Values Re-transformed Values 

Timeinhrs.| (yt) (gt) a y g € 
3 .075,758 | .075,705 .000,053 4.4 4.40 0.00 
6 .025,192 | .025,770 | —.000,578 6.6 6.47 0.13 
7 .021,593 | .021,901 | —.000,308 6.6 6.52 0.08 
8 .019,531 | .019,452 .000,079 6.4 6.43 —0.03 
9 018,215 | .017,825 | .000,390 | 6.1 6.23 | —0.13 
12 .017,361 | .015,431 .001,930 4.8 5.40 —0.60 
27 .011,223 .017,064 | —.005,841 3.3 2:17 1.13 
33 .023,310 | .019,193 | .003,397 | 1.3 1.58 | —0.28 
51 .024,510 | .026,663 | — .002,153 0.8 0.74 0.06 
Total . 236,693 | .239,093 | —.003,031 | 40.3 39.94 0.36 


In the relation (9) y is expressed as the sum of a systematic and a 
random part. Since any transformation of y will have the effect of 
combining e with functions of ¢, the transformed value of y may be 
regarded as the sum of two systematic parts. Applying the least squares 
method to the transformed values means minimizing that systematic 
part which contains e. This procedure will only be valid if, after re- 
transformation, the data can once again be expressed as the sum of a 
systematic and random part and, furthermore, if both the systematic and 
random parts of the re-transformed data tend to the corresponding parts 
of the pre-transformed data. A transformation fulfilling these require- 
ments will now be described. 

It will be observed that the reciprocal of ¢ multiplied by ?¢’ is an 
explicit function of ¢, since putting A = ab/c, B = (a+ b)/candC = 1/c 

4 = At? + Be +cr* (10) 
where g, will have differing values depending on r._ Because in this case 
the method of least squares is directly applicable, the analogue of (9) for 
this transformation becomes 
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U 
+e, (11) 


where e, is now a systematic combination of ¢ and functions of ¢. Apply- 
ing the same transformation to the original values of formulation (9) 
y 
where 4% has a value independent of r. Subtraction of (12) from (11) 
yields 


(12) 


et’ 
GG + © 


The question can now be asked. ‘For what value of r will the left- 
hand side of (13) tend to zero?” ‘This is equivalent to asking what 
transformation will ensure the equality of the re-transformed data with 
the pre-transformed data. The answer will, of course, depend on the 
nature of the practical data supplied, but if ¢ is measured in positive 
integers, convergency criteria will show the necessity of r being a nega- 
tive exponent. Practical requirements will also set a limit to the size of 
this negative exponent. Every curve fitting problem can only be worked 
within a, usually prescribed, interval of accuracy, e.g., accurate to 6 
decimals. The exponent should thus have a value such that the magni- 
tudes of the terms in the normal equations will fall within the required 
limits of accuracy. This is very important since reciprocals of powers of 
t enter into the quantities of the normal equations. In practice, there- 
fore, r will seldom exceed the value of —2, but the determination of the 
best size of r remains essentially a practical problem. What is impor- 
tant, however, is that whenever a transformation is applied to practical 
data to facilitate curve fitting, that particular transformation from 
which the normal equations employed, were derived, should be stated 
explicitly, so as to ensure uniformity of estimation. 

If then a suitable numerical value for the exponent r has been found, 
the left-hand side of (13) tends to zero, and it follows that %, is approxi- 
mately equal to 7. If e, is now expressed in terms of ¢« and substituted 
in (11) and this result re-transformed we again arrive at a very near 
approximation of the basic relation of (9), thus completing the demon- 
stration of the argument. In this particular experiment practical trial 
showed that r = —1 gave a satisfactory fit for the range of data ob- 
served. An example is given in Table II to show the relative magnitudes 
of the two residuals. The smaller the mean and variance of e, the better 
the fit of the re-transformed values. 


(13) 
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Supposing now in general that investigation has disclosed what 
transformation should be applied. Find the normal equations by mini- 
mizing the appropriate e, and solve them, by one of the many available 
methods, for A, B, and C. It is easily seen that the coefficients a and b 
are the solutions of the quadratie equation 


Ca’ — Ba + A = 0 (14) 


The various characteristics of the concentration curve may now be 
calculated by substitution of the values just found. 


8. Comparison of Blood Concentration Data. So far we have been dealing 
with a single concentration curve, setting up a general model, and 
deriving certain measurable aspects from this model. What happens 
when we have a collection of concentration curves, derived from a 
group or groups of individuals? What conditions must be satisfied in 
order that valid comparisons can be made between the characteristics 
of individuals or groups of individuals? 

The numerical values of the three characteristics derived depend on 
different combinations of the two coefficients +/a and +/b only, where 
Vaand Vb are the time points where half of the original dose is either 
excreted or absorbed. The magnitudes of +~/a and +/b are not, how- 
ever, necessarily dependent on the size of the dose only. The basic 
processes of absorption and excretion could possibly be accelerated or 
retarded by experimental conditions, using either the same or different 
doses. This possibility leads to two different approaches in dosage 
experiments. One, to compare the possible effect of experimental condi- 
tions on the concentration curve using equivalent dosage, the other to 
compare the effect of different doses under the same experimental condi- 
tions. These two approaches should be clearly distinguished. 

If the object is to compare experimental conditions using the same 
dosage level, it is necessary to derive a criterion for establishing the 
equivalence of doses given to different individuals. Since concentration 
is defined in terms of blood volume it is reasonable to assume, celer?s 
paribus, that the intensity of the whole absorption-excretion process is 
dependent on the size of the total blood volume. On this assumption 
equivalent doses will be obtained whenever the relative doses given to 
individuals are equal. It will be recalled that relative dose was defined 
as k = K-10°/V. 

Total blood volume, V, is usually unknown though in some cases 
estimates can be obtained experimentally or theoretically. In practice 
usually, the size of the dose is determined by the total body weight of 
the individual, the underlying assumption being the existence of a direct 
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relationship between body weight and blood volume of individuals. 
The correctness of this assumption can be tested statistically by calcu- 
lating k from relation (5b) for each individual, and determining the 
homogeneity of the set of k values so calculated by the usual statistical 
tests. 


TABLE III. 
HOMOGENEITY OF RELATIVE DOSE 
Treatment a b c k (k — k)/s 

202 12 1,686 8.84 0.9 

A 159 11 1,606 10.82 2.1 
216 10 2,080 10.10 iy 

B 1021 7 6,384 6.30 —0.6 
826 7 5,501 6.72 —0.3 

Cc 2428 4 14,707 6.05 —0.7 
2013 5 11,856 5.89 —0.8 

621 7 5,117 8.24 0.6 

D 617 4 5,133 8.32 0.6 
749 6 5,956 8.02 0.4 

843 7 6,698 8.02 0.4 

2466 14,150 5.74 -0.9 

E 771 3 5,868 7.63 0.2 
992 3 6,884 6.96 -0.2 

4352 3 22,387 5.15 -—1.3 

F 3955 5 20,221 5.11 -1.3 
1796 4 10,605 5.92 -—0.8 

Average (k) 7.28 
Std. Deviation (s) 1.67 


In this particular experiment Table III shows that equivalence of dose 
can be assumed, if 1.96 is used as critical ratio on the 5% significance 
level. This is then the necessary condition to be satisfied before any 
comparisons are attempted. Having this, the ratio T,,,/T can be calcu- 
lated for each individual and the results plotted on the basis of equation 
(7c). The usual analysis of variance techniques can now be applied to 
the group means to establish significance of differences, and thus de- 
termine the most efficient treatment. In figure 3, for instance, the let- 
tered vertical lines represent the averages of the ratios T,,/T for the 
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different treatments A to F, and shows treatment F to be the most 
efficient of the set. 


9. Conclusion. (1) By introducing hypothetical absorption values from 
which actual excretion values are subtracted a “concentration law” can 
be derived, describing the change in blood concentration of a drug as a 
function of time. This law, applicable to a wide range of data is com- 
pletely determined by only two coefficients. (2) Appropriate combina- 
tion of these two coefficients leads to statistics, analogous to those of a 
frequency distribution, by means of which two or more concentration 
curves can be compared. (3) An efficiency scale, equal to a segment of 
the standardized normal distribution, is derived from the interrelation- 
ship of these statistics. This scale can be used to evaluate the efficiency 
of different treatments. 


This paper is published with the permission of the Director of Veterinary 
Services, Union of South Africa. Grateful acknowledgement is made to 
Professor Richard Clark, Head of the Physiology Section of this institute 
for posing the problem and supplying the data, my colleague, G. 
Abraham, Esq., for various suggestions incorporated in this paper, and 
the Art Section for preparing the diagrams. 
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EVALUATION OF A CLASS OF DIAGNOSTIC TESTS 
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peer diagnostic tests, such as stool examination for specific 
infections, or the NIH swab for oxyuriasis, for which a positive 
diagnosis is based on definite identification of the organism, or its cysts 
or trophozoites in the stool, or of the pinworm egg in the swab, constitute 
a class of diagnostic tests which, under certain control conditions, yield 
no false positives. 

How good any one such diagnostic test is, is measured by the proba- 
bility that an infected person will be found positive by a single applica- 
tion of the test. If we assume that this probability is the same for all 
infected individuals, we may term this probability the efficiency of the 
diagnostic test. This assumption implies that we are considering only 
infections diagnosable by the test used, rather than cases for which the 
locus or form of infection is such that the test cannot diagnose it. 

Given a group of known infected persons, the estimation of efficiency 
is simple; it is given by the ratio of number of positive results obtained 
to the total number of examinations performed on the known infected 
persons. In general, however, we do not have such a group of known 
infected persons, but have instead a group of unknown persons on each 
of whom one or more examinations are made. Also, in addition to esti- 
mating the efficiency of the test, we may be required to estimate the 
prevalence of the infection in the population for which our group is 
considered to be a representative sample. ‘These problems, estimation 
of efficiency and prevalence, will be considered here. 


Equal Numbers of Examinations 


Where each individual in the group receives the same number of 
examinations, [spaced far enough apart in time so as to be independent 
though not so far apart as to incur any appreciable risk of developing an 
infection] estimators of the efficiency, L, and of the prevalence rate, P, 
are readily provided by the method of maximum likelihood. These 
estimators have been determined by Sawitz and Karpinos [1] and, in 
the notation of the present paper, are given by:' 


1Uncritical use of these formulas may occasionally give estimé ates of prevalence greater than 100%. 
When this occurs the correct maximum likelihood estimates are P = 1, £ = R/nN. Similar considera- 
tions apply when the proposed estimators below result in anomalous estimates. The estimates to be 
used are then the same as for the maximum likelihood case. 
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(1) BE = Ril — (1 — B)"|/nK 


(2) P = R/nNE = — (1 — 


where RF is the total number of positive test results, n is the number of 
examinations given each individual, K is the number of individuals with 
any positive test results, and N is the number of individuals in the group. 
These estimates can readily be seen to be: (1), the adjusted ratio of 
number of positive examinations to number of examinations on detected 
infected individuals, and (2), the adjusted ratio of detected infected 
individuals to number of individuals examined. In each case the adjust- 
ment corrects for the probable number of infected persons not detected. 

The maximum likelihood solution is readily applicable also to the 
case in which each individual receives the same number of examinations, 
but a battery of technics is applied on each examination. The formulas 
shown for the preceding case will provide estimates of prevalence and of 
the efficiency of the battery asa whole. The efficiency of any one technic 
is estimated by 


(3) E, = R,[1 — (1 — B,)"|/nK 


where R, is the total number of positive test results by the technic and 
E, is the estimated efficiency of the battery.” 


Unequal Numbers of Examinations 


Efficiency 


When the number of examinations given an individual is not con- 
stant, but varies from person to person, the maximum likelihood equa- 
tions become too cumbrous for general usefulness. Instead, the esti- 
mator of efficiency proposed here is given by the ratio 


(4) E = A/B 


where A is the total number of positive examination results by the tech- 
nic obtained on persons known to be infected independently of this 
examination, and B is the total number of examinations by the technic 
made under conditions for whi¢h the individual is known to be infected 
independently of the examination. An individual is known to be infected 
independently of an examination if there is a prior knowledge of infec- 


2This distinction between E, and Ep was overlooked by Sawitz and Faust [2] in a study involving 
a battery of technics applied to stool examinations. This led to underestimation of the efficiency of the 
technics and combinations of technics reported on. In another study by Faust, et al. [3] a battery of 
ten technics of stool examinations was applied to single stools from each individual. The estimates 
reported in Faust’s paper [3] should properly be considered to be relative efficiencies of the indivi iual 
technics or combinations of technics considered, with the efficiency of the complete battery representing 
100%. 
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tion, or if at least one positive result has been obtained by any technic 
on any other examination. If the only knowledge of infection comes 
from positive results by one or more technics on a single stool there is no 
independent knowledge of infection for any technic for this examination. 
lor all other examinations this independent knowledge exists for all 
technics. 

It follows readily from this definition that the ratio of the expected 
value of A to the expected value of B is the true technic efficiency. For, 
with respect to any examination by a technic on an infected indi- 
vidual the expected contribution to B is the probability that the person 
will be independently known to be infected, and the contribution to A 
will be this probability times the probability of getting a positive result 
by the technic on this examination. But this latter probability is the 
technic efficiency. The results of examinations on noninfected persons 
make no contribution to either A or B. 

Where only a single technic is being used, the suggested estimate of 
efficiency, A/B, becomes (R — S)/(T’ — S), where R is the total number 
of positive test results, S is the number of persons with only one exami- 
nation positive by the technic, and T is the total number of examinations 
by the technic performed on individuals found to be positive. If each 
individual receives the same number of examinations, n, T = nk. 
Where several technics are applied at each examination the estimate of 
efficiency for any one technic becomes (R — S’)/(T — S’ — D) where 
R is the total number of positive results obtained using the technic, S’ is 
the number of persons with positive results for only one examination, 
and positive on that examination by the technic, and D is the number of 
persons with positive results for only one examination, but negative on 
that examination by the technic. Where some of the persons examined 
were known to be infected prior to examination, the formula for E may 
be computed by the formulas above for the remaining individuals, but 
with the numerator and denominator increased by the number of positive 
results and the number of examinations, respectively, by the technic on 
the group of known infections. In all cases, the standard error of esti- 
mate of E can be taken approximately as though it were a proportion 
based on a sample of size B, the denominator of the estimate; that is 


(5) SE(E) + V/E(1 — E)/B. 


Unequal Numbers of Examinations—Prevalence 


Maximum likelihood estimation of the prevalence rate is also difficult 
when the number of examinations varies from individual to individual. 
A simplifying assumption makes the prevalence rate amenable to esti- 
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mation. This assumption is that, though the number of examinations 
made varies from individual to individual, the variation in number of 
examinations made is random and is not associated with whether or not 
the individual is infected. Under this assumption the estimate of preva- 
lence provided by the results of a technic is 


(6) P = R/GE, 


where G is the grand total of examinations made using the technic. This 
estimate can be seen to be the same as the maximum likelihood one apply- 
ing in the Sawitz-Karpinos case where, since all individuals receive the 
same number of examinations, the condition of non-association is met. 

The error of estimate of P is made up of three variable factors: the 
variation in the number of infected persons in the sample examined, the 
variation in the proportion of positive stool examinations made in exam- 
ining infected persons, and the variation of the estimate of efficiency 
used in estimating P. If F is known exactly, the standard error of P is 
given by 


En NP 


where 7 is the average number of examinations performed on an indi- 
vidual and the condition of non-association is met. If an independent 
estimate of , with known standard error, is used in estimating P, the 
standard error of P is given by 


(7) SE(B) ~ Pal! 4 


> , SE(E) 

(8) su(h) = Py| 

However, for the situation with which we are concerned, we neither 
know E nor have an independent estimate of it. Instead we have an 
estimate of # which is correlated with R, the number of positive test 
results by the technic. When all individuals receive the same number of 
examinations by a single technic the covariance of R and E is approxi- 
mately E(1 — F) and the standard error of the prevalence rate, using 
the sample efficiency, is given by 


(9) SE(P) = NP EnNP* EB 


In some situations, while the number of examinations given an indi- 
vidual may be associated with whether or not he is infected, it is possible 
to select the specific examinations which are not associated with infec- 
tion. This occurs, for example, when a sample of individuals each 
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receives a single examination and, in order to determine efficiency more 
precisely, the preponderance of subsequent examinations are given to 
individuals for whom the first examination was positive. Since the first 
examination was given independently of infection we can estimate the 
prevalence from the results of the first examination, P = R,/NE, where 
R, is the number of positive first examinations, and £ is the examination 
efficiency (See footnote 3) estimated from all examinations. For this 
situation the covariance of R, and E can be considered small enough to 
be disregarded and the standard error of P becomes 


10 = 1-E , 


An alternative estimate of the prevalence rate provided by the results 
of a technic is given by 


a 1 
11 
(11) 


where r; and n, are respectively the number of positive results and num- 
ber of examinations by the technic for the z’th individual. This estimate 
is applicable whether or not the number of examinations given an indi- 
vidual is associated with infection, and so is useful when one cannot 
make, or is unwilling to make, the assumption of non-association.* The 
greater computational labor required for this estimate limits its useful- 
ness. 


Limitations of the Procedures 


The procedures indicated above provide estimates of efficiency and 
prevalence under the assumption that examination efficiency is the same 
for all infected individuals. For some diseases this assumption may not 
be realistic, and instead examination efficiency may vary from infected 
individual to infected individual, depending upon the nature and degree 
of infection. Use of the procedures above results in a weighted average 
efficiency, the weights tending to be proportional to the probability of 
getting at least one positive result for an individual. This weighting 
results in an estimate of efficiency which on the average, will be greater 


*This alternative estimate of the prevalence rate does not apply when the association between 
number of examinations and infection arises from a sampling plan under which the decision to make 
further examinations on an individual depends on the results of his previous examinations. Under 
such sampling plans the estimates of efficiency and prevalence in this paper are not applicable either, 


though estimates may be formulated to fit such sampling plans. For the plan considered in the fore- 
going paragraph the procedures of this paper for estimating efficiency would be applicable if the results 
of the first examination were treated as providing supplemental information that a person is infected, 
not for purposes of estimating efficiency directly. 


4 


CLASS OF DIAGNOSTIC TESTS 245 


than the average efficiency among infected persons, and in too small an 
estimate of prevalence. With increasing numbers of examinations given 
an individual the disparity in weights becomes small and the estimated 
rates approach the population average efficiency and population preva- 
lence. This characteristic arises from the use of estimates which depend 
on the self-consistency of examination results, and applies both to the 
maximum likelihood estimates and to the proposed estimates. 

A partial answer to the problem of varying efficiency is given in a 
paper by Neyman [4]. The procedure indicated is to postulate the exis- 
tence of a limited number of degrees of infection, each having its own 
efficiency and prevalence rate and to make estimates of these. In his 
example Neyman postulates two degrees of infection, other than unin- 
fected, the greater degree of infection being so severe as to have 100% 
efficiency. The estimate of efficiency for the lesser grade of infection 
continues however to be a biased average efficiency. Even so, for the 
purpose, this estimate is useful as a measure of efficiency for diagnosing 
moderate infections before irreparable damage by the disease, in Ney- 
man’s case tuberculosis, has been done. There would be little gain in 
making this distinction for infections for which 100% efficiency is not 
consonant with irreparable damage. 


An Application 


Examples of the results of use of the procedures given here may be 
found in Tobie et al. [5]. In this paper are presented the results of 
examination for various intestinal protozoa of a total of 684 stools taken 
from 243 persons. The number of stools examined for a person ranged 
from one to seven, and all stools were examined by the zinc sulfate tech- 
nic. In general, all stools, except the first taken from a person, were also 
examined by the direct smear technic. Application of the methods 
given here result in the following estimates of efficiency for the detection 
of amebiasis: 


Direct smear technic—27%, with R = 97, S’ = 7, T = 344, D = 5; 


Zine sulfate technic—59%, with R = 258, S’ = 18, T = 426, D = 4; 


Combination of the two technics—62%, with R = 218, S = 12, T = 344. 


Estimates of prevalence and of the efficiencies of the technics for 
detecting other parasitic infections are also reported. Also, the recom- 
mendation is made that, in examining for amebiasis, repeated examina- 
tions by the zinc sulfate technic on different stools be used in preference 
to using various technics on a single stool. The results given in [5] were 


be 
| 
be 


246 BIOMETRICS, SEPTEMBER 1951 


further used by Tobie et al. [6] in planning a study in the chemotherapy 
of amebiasis. 


REFERENCES 


1. Sawitz, W. and Karpinos, B. D.: Statistical problems involved in the application 
of the N.I.H. swab for the diagnosis of oxyuriasis. Amer. Jour. Hyg. 35, 15-26, 
1942. 

2. Sawitz, W. G. and Faust, E. C.: The probability of detecting intestinal protozoa 
by successive stool examinations. Am. Jour. Trop. Med. 22, 131-136, 1942. 

3. Faust, E. C.; Sawitz, W.; Tobie, J.; Odom, V.; Peres, C.; and Lincicome, D. R.: 
Comparative efficiency of various technics for the diagnosis of protozoa and 
helminths in feces. Jour. of Parasitology, 25, 241-262, 1939. 

4. Neyman, J.: Outline of statistical treatment of the problem of diagnosis. Public 
Health Reports, Vol. 62, No. 40, 1449-1456, 1947. 

5. Tobie, J. E.; Reardon, L. V.; Bozicevich, J.; Shih, B.-C.; Mantel, N.; and Thomas, 
E. H.: The efficiency of the zinc sulfate technic in the detection of intestinal pro- 
tozoa by successive stool examinations. Am. Jour. Trop. Med. (in press). 

6. Tobie, J. E.; Most, H.; Reardon, L. V.; and Bozicevich, J.: Laboratory results 
on the efficacy of terramycin, aureomycin and bacitracin in the treatment of 
asymptomatic amebiasis. Am. Jour. Trop. Med. 31, 414-419, 1951. 


ASYMPTOTIC REGRESSION 
W. L. STEVENS 


Faculdade de Ciéncias Econémicas e Administrativas, 
Universidade de Sio Paulo, Brazil 


SUMMARY 


HE REGRESSION equation, y = a + £p’, is very useful for representing 

the relation between y and x, when y tends asymptotically to a limit as 
x tends to infinity. Examples of such a relation are found in the growth of 
an animal approaching maturity and in the yield of a crop as function 
of quantity of fertiliser. By simple transformation, other asymptotic 
regression formulae, such as Gompertz’s and the logistic, can be brought 
to the above form. 

The arithmetic labour of making a least-squares adjustment has 
proved so great that few research-workers appear to have attempted it, 
with a consequence that unsatisfactory and very inefficient methods 
have generally been adopted instead. It is therefore of immediate prac- 
tical value to rationalise and reduce the arithmetic labour of finding 
efficient estimates of the three parameters. It is shown that the covari- 
ance matrix, which is used in Fisher’s process of estimation, is effectively 
a function of p only. Tables are provided of the terms of this matrix 
for five, six and seven equally spaced values of x, these being the numbers 
of levels of x which will prove most useful in experimentation. Worked 
examples are given to show how, with the help of these tables, the 
arithmetic labour need no longer be considered an obstacle. 


1. INTRODUCTION 


Although the field of application of polynomial regression formulae 
is extremely wide, there yet exists a large number of regression prob- 
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lems for which they are unsuitable, either because the polynomial curve, 
in fact, does not provide an adequate graduation of the data or be- 
cause the curve is capable of taking a form which must be rejected 
intuitively. Perhaps the most frequently encountered type of problem 
for which a polynomial regression is clearly unsuitable is one in which 
the value of the dependent variable, y, steadily approaches an unknown 
asymptotic value, as x passes to infinity. 

Particularizing again, we suggest that, to deal with this type of prob- 
lem, the simplest and most useful regression formula is 


y=at Bo 


where 
0O<p<l 


The equation contains three parameters, a representing the asymptotic 
value of y, 6 representing the change in y when z passes from 0 to + © 
and p representing the factor by which the deviation of y from its asymp- 
totic value is reduced every time we take a unit step along the axis of z. 
This regression curve is of course the familiar exponential curve, trans- 
lated and perhaps inverted, as can be seen if we write the equation in 
the equivalent form 


y =a-+ B exp (—72) 
where 


y = —log p 


and is therefore positive. 

The equation is one which we meet repeatedly in every branch of 
science. It is, for example, Newton’s Law of Cooling, showing the 
temperature (y) of a cooling body as a function of time (x), where a 
represents the room temperature. 

The equation has been used successfully to represent the relation 
between the yield of a crop (y) and the rate of application of a fertiliser 
(x). It is equally useful for representing the growth of an organism 
when it is approaching maturity. When used for describing the response 
to a fertiliser, it is often known as Mitscherlich’s law, though it could be 
said that this law asserts not merely the form of the equation but also the 
constancy of the parameter p for a given fertiliser over a wide range of 
conditions. 


When used to express the response to a fertiliser, the equation is more 
often found in the forms 


che 
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y= +d — 10) 


or y = A{l — 


which imply of course that 8 is negative and that the curve rises. 

The utility of the basic equation (1.1) may be greatly extended if we 
are permitted to make a few simple transformations of y. Taking the 
exponential function of y, we have 


z = exp (y) = exp (@ + 6p) (1.3) 


which is Gompertz’s law, used by actuaries for graduating life tables 
and by economists for predicting price changes. Again, if we take the 
reciprocal of y, we get 


a+ Bp 
which is the equation of the logistic curve, used by demographers to 
describe population growth. 

Methods of estimation given in textbooks are often very crude: a 
favourite one is to subtotal the expected and observed values of y in three 
equal consecutive groups (throwing away one or two observations if 
their number is not divisible by three) and solving the resulting three 
equations for the three parameters. These estimates are evidently con- 
sistent but, as we shall see later, they may be very inefficient. 

An alternative method, also very popular, is to guess a value for p 
or to use one based on previous experience, and thus to reduce the prob- 
lem to one of linear regression. This procedure is peculiarly dangerous 
if the investigator uses a constant value of p in analysing a series of 
experiments and then subsequently asserts that the constancy of p is a 
fact revealed by these same experiments. 

Yet a third popular method is to guess the asymptotic value by 
inspecting the results or their graph and then to transform the equation 
to 


(1.4) 


z = log (a — y) = log (—8) + log p 


thus again reducing the problem to one of linear regression. 

Frederico Pimentel Gomes and Euripedes Malavolta have shown 
how to derive from the normal equations an equation in r only. The 
arithmetic labour for their method appears to be heavy. 

H. O. Hartley has arrived at a method of solution by showing that y 
can be expressed as a bilinear regression on its own partial sums and 2. 
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The estimates are not however efficient, unless we adopt about the 
errors of y an hypothesis which conflicts with the nul hypothesis tested 
in an analysis of variance. 


2. ESTIMATION 


We shall assume that all observations may be given the same weight. 
There is no mathematical difficulty in writing down the normal equations 
to give the least-square or maximal likelihood estimates. Since these 
equations yield estimates, it will be appropriate to replace a, 8 and p by 
a, b and r respectively. 


—an — bS(r*) 
— aS(r’) — bS(r**) +¥Y, =0 (2.1) 


— abS(ar*) — + DY, = 0 


n = number of observations 
S(---) = summation over the n values of x or y 
where eo = S(y) (2.2) 
Y, = 
= 


The information matrix is found by differentiating the expressions in the 
normal equations with respect to a, b and r in turn, putting each y equal 
to its expected value and changing all signs: 


n S(r’) bS(ar*~*) 
{I} =< S@’) S(r**) bS(ar**~*) (2.3) 


bS(ar*~") bS(2r°*~*) 


It will be noted that the first column of terms in the block of normal 
equations is the first column of terms in the information matrix multi- 
plied by (—a). Similarly the second column is the second column of the 
information matrix multiplied by (—b). 

On inverting the information matrix, we find that the covariance 
matrix has the form 
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Fa F,,/b 


F,,/b F,,/b F,,/b° 


where F,, Fu, Fu. F., F., and F,, are functions only of 7, being in fact the 
components of the reciprocal of the matrix: 


n S(r*) S(xr*~*) 
S(r’) S(r°*) (2.5) 


Following Fisher’s general method, we may start from preliminary 
inefficient estimates a’, b’ and c’ and insert these in the left-hand-side of 
the normal equations which, instead of yielding exactly zero, now take 
the small values A, B and R respectively. Efficient estimates, a, b and r, 
are now found by adding to the preliminary estimates, the respective 
increments, 6a, 6b and ér, where 


da = F,,A + + F.,R/b 
6b = + F,,B F,,R/b (2.6) 


bor = A+ F,,B + F,,R/b 


In consequence of the relations, noted above, between columns of the 
block of normal equations and columns of the information matrix, the 
expressions for the increments simplify to 


6a = + + + 
6b = —b’ + + FuY: + (2.7) 


be = + + 
Hence efficient estimates are 


a=a'+ 6a = + + 


b = b’ + 6b = FaY + FuYi + 


r=r + or 


a? 
- 
3 
| 
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where br = (F.,Y + + F,,Y¥2)/b (2.8) 


We now observe the curious fact that the preliminary estimates of 
a and @ have fallen out of the equations. This means that we need find 
a preliminary estimate only for p: the estimates of the other two 
parameters are then given explicitly in terms of functions which depend 
only on the preliminary estimate of p and, of course, on the observations. 

If the values of x are equally spaced (or spaced according to some 
regular pattern) it becomes practicable to tabulate the covariance 
matrix and thus to eliminate a large portion of the arithmetical labour 
usually required. In the general problem of adjusting a regression 
formula containing three parameters, this would be impracticable be- 
cause a table of triple entry would be needed. 

Finally we have to determine the sum of squares of deviations from 
the fitted regression curve: 


Sty 
Since the parameters have been efficiently estimated, this will be equal to 


S(y*) — S(y2) 
where (2.9) 
S(yt) = aY + bY, + (b ar)Y, 


Caution is needed in using this method when the deviations from the 
regression curve are expected to be very small, because the residual sum 
of squares then appears as a small difference between two large quanti- 
ties and a relatively small error in S(y?), will produce a relatively large 
error in S(y — y,)*. In such eases it is advisable to calculate each 
expected value, y, , and thence to obtain the residuals and the sum of 
their squares. 

It is a good practice to modify the values of y by subtracting from 
them a convenient constant lying somewhere near the middle of their 
range. The resultant gain in accuracy of the estimates is slight but the 
gain in accuracy of S(y — y.)°, when found by the short method, 
may be considerable. 


3. EQUALLY SPACED ORDINATES 


Although the literature is full of experiments in which the author has 
used an arbitrary series of values of xz, we must suppose that, until a 
cogent reason has been offered to the contrary, the most simple, natural 
and useful series is one in which the values of z advance in equal steps. 

If there are n values of x, we may, by a simple change of units and 
origin, designate these as 


a 
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0 1 2 ove 3) 


It remains for us to decide on a suitable value for n. Clearly three is 
insufficient, since in such an experiment a perfect fit would be guaran- 
teed. Also four is hardly enough, since it would leave only one degree of 
freedom for testing deviation from the proposed regression formula. On 
the other hand, it will not usually be necessary to go much beyond seven. 

We conclude that the most useful numbers of levels, for experimental 
purposes, are five, six and seven. We therefore provide tables, for n = 5, 
6 and 7, of the components of the F matrix. The argument, 7, is given 
in steps of 0.01, except in the higher values where the steps are of 0.005. 
The preliminary estimate is always rounded to the nearest tabulated value, 
so that interpolation in the table is unnecessary. Two other tables, 
over the same values of the argument, will be required, one of r* and the 
other of zr**, but these are not provided here as they can be readily 
prepared from a table of powers. Tables of S(r”) and S(zr*~*) are also 
useful for checking the contents of the multiplier register when forming 
ana Y,.. 

If the preliminary value of r falls beyind the upper limit of tabulation, 
it is a warning against attempting to fit an asymptotic regression formula. 
A parabola of the second degree (which has the same number of param- 
eters) will probably provide as close a fit with less trouble. In illustration 
of this remark, we give two series of six values, the first of which follows 
an asymptotic formula with p = 0.7 and the second a polynomial of the 
second degree 


asymptotic: 50.0 90.0 111.0 125.7 136.0 143.2 


parabola: | 51.0 88.5 | 110.3 126.6 | 137.2 142.3 


It would need a fantastically extensive fertiliser trial to discriminate 
between two such series. 

Many research workers will however argue in favour of using an 
asymptotic formula, in a case like this one, on the grounds that it will 
give plausible results when extrapolated for high values of x. But such 
extrapolations are extremely dangerous. Since the data themselves 
(in an example like this) are quite insufficient to establish the form of the 
regression equation, it is necessary to have established the absolute 
generality of the preferred formula by extensive experimentation or 
observation, backed, if possible, by theoretical justifications. This is 
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THE COVARIANCE MATRIX 
=5 


r Faa Fab Foe Far For Pry r 
(+) (—) (+) (=) (+) (+) 

0.25 0.57704 —0.54370 =1.50839 —0.66341 0.40529 1.58696 0.25 
-26 59753 .56058 =1.52135 -68654 41796 1.59977 -26 
.27 61960 .57874 =1.53528 .71097 43190 1.61315 
-28 64339 .59831 1.55028 . 73681 44721 1.62717 .28 
-29 66907 -.61944 1.56647 .76414 46399 1.64186 -29 

0.30 0.69681 —0.64228 1.58398 —0.79310 0.48233 1.65730 0.30 
.31 72683 .66700 1.60296 . 82380 50237 1.67352 .31 
.32 75934 -69382 1.62359 .85637 52422 1.69061 .32 
.33 79459 .72294 1.64606 . 89097 54802 1.70862 .33 
83289 .75463 1.67058 .92776 57393 1.72763 

0.35 0.87452 —0.78916 1.69740 —0.96691 0.60211 1.74769 0.35 
.36 0.91986 .82684 1.72680 1.00861 .63274 1.76890 
.37 0.96930 .86804 1.75908 1.05308 .66603 1.79132 
.38 1.02329 91316 1.79462 1.10055 70220 1.81506 .38 
.39 1.08233 0.96265 1.83380 1.15128 74148 1.84019 .39 

0.40 1.14699 —1.01702. 1.87710 —1.20556 0.78416 1.86681 0.40 
41 1.21793 1.07687 1.92504 1.26369 83053 1.89504 41 
-42 1.29586 1.14286 1.97823 1.32602 88092 1.92498 -42 
.43 1.38162 1.21575 2.03734 1.39295 93570 1.95676 .43 
.44 1.47615 1.29641 2.10318 1.46488 0.99528 1.99050 44 

0.45 1.5805 —1.3858 2.1767 —1.5423 1.0601 2.0263 0.45 
-46 1.6960 1.4852 2.2588 1.6258 1.1307 2.0645 
47 1.8240 1.5957 2.3508 1.7158 1.2077 2.1050 47 
.48 1.9661 1.7190 2.4541 1.8132 1.2916 2.1481 48 
2.1243 1.8567 2.5703 1.9186 1.3832 2.1941 49 

0.50 2.3005 —2.0109 2.7013 —2 .0328 1.4833 2.2431 0.50 
51 2.4975 2.1840 2.8493 2.1568 1.5929 2.2953 51 

52 2.7180 2.3786 3.0168 2.2917 1.7129 2.3511 .52 
2.9655 2.5980 3.2068 2.4387 1.8445 2.4107 .53 
3.2439 2.8459 3.4229 2.5991 1.9891 2.4744 

0.55 3.5579 —3.1267 3.6691 —2.7745 2.1482 2.5426 0.55 
.56 3.9128 3.4456 3.9505 2.9667 2.3235 2.6157 .56 
.57 4.3152 3.8086 4.2728 3.1777 2.5170 2.6942 .57 
.58 4.7725 4.2231 4.6428 3.4099 2.7310 2.7784 .58 
.59 5.2940 4.6976 5.0690 3.6659 2.9681 2.8690 .59 

0.60 5.8902 —5.2426 5.5612 —3.9488 3.2314 2.9666 0.60 
.61 6.5742 5.8704 6.1312 4.2623 3.5245 3.0719 .61 
.62 7.3615 6.5960 6.7935 4.6106 3.8514 3.1857 .62 
.63 8.2709 7.4375 7.5655 4.9986 4.2170 3.3087 .63 
.64 9.3251 8.4169 8.4686 5.4320 4.6270 3.4422 

0.65 10.5519 —9 9.5287 —5.9176 5.0880 3.5871 0.65 
-655 11.2405 10.2054 10.1275 6.1824 5.3400 3.6644 .655 
11.9856 10.9037 10.7781 6.4635 5.6080 3.7450 .66 
.665 12.7928 11.6615 11.4858 6.7619 5.8929 3.8292 665 
.67 13 .6681 12.4850 12.2564 7.0792 6.1964 3.9172 .67 

0.675 14.6186 —13.3807 13.0965 —7 .4167 6.5197 4.0092 0.675 
.68 15.6518 14.3562 14.0133 7.7761 6.8645 4.1056 .68 
.685 16.7764 15.4198 15.0151 8.1593 7.2326 4.2064 .685 
.69 18 .0020 16.5809 16.1109 8.5681 7.6259 4.3121 .69 
.695 19 .3393 17.8501 17.3111 9.0047 8.0466 4.4230 

0.70 20.8006 —19 18 .6273 —9 .4716 8.4970 4.5393 0.70 


| | | | 
i> 
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TABLE FOR CALCULATING THE COVARIANCE MATRIX—continued 


n=6 
r | Faa Fab | Far Fy Fy r 
| (+) (-) (+) (-) (+) (+) 

0.25 0.37223 —0.34952 =1.32429 —0.43372 0.18752 1.32937 0.25 
.38180 .35671 1.32870 44545 .19013 1.33034 .26 
.27 .39204 .36438 1.33337 45776 . 19339 1.33140 
.28 .40300 .37259 1.33834 .47070 .19734 1.33258 .28 
.29 -41477 .38139 1.34364 48431 . 20204 1.33392 .29 

0.30 0.42741 —0.39084 1.34932 —0.49864 0.20752 1.33545 0.30 
31 .44100 40101 1.35542 .51375 .21383 1.33720 .31 
.32 .45563 41196 1.36201 .52970 .22103 1.33922 .32 
.33 .47142 42379 1.36915 4655 . 22920 1.34156 .33 
.B4 .48847 1.37690 56438 . 23838 1.34424 

0.35 0.50690 —0.45046 = 1.38535 —0.58326 0.24864 1.34732 0.35 
.36 .52687 46552 1.39459 .60328 26007 1.35083 .36 
-54853 .48191 1.40473 .62452 .27275 1.35483 .37 
.38 .57206 49977 1.41589 .64710 . 28677 1.35937 .38 
.39 59766 .51928 1.42822 .67112 . 30223 1.36448 .39 

0.40 0.62556 —0.54062 1.44185 —0.69670 0.31926 1.37023 0.40 
41 .65601 .56403 1.45699 . 72398 .33796 1.37667 41 
-68931 -58975 1.47384 .75312 .35848 1.38385 
43 .72578 -61805 1.49263 78427 . 38098 1.39184 -43 
.44 -76580 -64928 1.51364 .81761 .40561 1.40070 .44 

0.45 0.80979 —0.68379 1.53720 —0.85335 0.43258 1.41050 0.45 
0.85823 -72202. 1.56366 0.89172 -46208 1.42131 46 
A7 0.91169 -76445 1.59345 0.93297 49436 1.43321 47 
48 0.97079 -81165 1.62707 0.97736 .52968 1.44628 48 
49 1.03627 .86428 1.66509 1.02523 56834 1.46061 49 

0.50 1.10897 —0.92308 1.70818 —1.07692 0.61066 1.47629 0.50 
51 1.18987 0.98893 1.75711 1.13283 .65703 1.49344 
.52 1.28010 1.06286 1.81280 1.19340 . 70786 1.51217 .52 
.53 1.38097 1.14606 1.87632 1.25913 . 76364 1.53260 .53 
54 1.49401 1.23992 1.94893 1.33059 82491 1.55487 .54 

0.55 1.62101 —1.34608 2.03212 —1.40843 0.89229 1.57912 0.55 

56 1.76408 1.46648 2.12766 1.49339 0.96650 1.60553 .56 
.57 1.92570 1.60341 2.23765 1.58631 1.04833 1.63428 .57 
.58 2.10879 1.75957 2.36457 1.68814 1.13871 1.66557 -58 
.59 2.31683 1.93820 2.51142 1.80001 1.23873 1.69964 .59 

0.60 2.5540 —2.1432 2.6818 —1.9232 1.3496 1.7367 0.60 

61 2.8251 2.3791 2.8800 2.0591 1.4728 61 
62 3.1363 2.6516 3.1113 2.2096 1.6099 1.8212 62 
63 3.4947 2.9675 3.3820 2.3765 1.7630 1.8692 63 
64 3.9089 3.3349 3 7000 2.5623 1.9343 1.9217 64 

0.65 4.3897 —3.7640 4.0747 —2.7696 2.1264 1.9790 0.65 
655 4.6589 4.0054 4.2869 2.8824 2.2313 2.0097 655 
.66 4.9500 4.2671 4.5179 3.0018 2.3426 2.0418 66 
.665 5.2648 4.5511 4.7697 3.1284 2.4609 2.0754 665 
.67 5.6058 4.8596 5.0443 3.2627 2.5867 2.1106 67 

0.675 5.9756 —5.1951 5.3441 —3 .4053 2.7205 2.1475 675 
.68 6.3771 5.5605 5.6719 3.5569 2.8631 2.1861 68 
.685 6.8136 5.9588 6.0306 3.7181 3.0151 2.2267 685 
.69 7.2887 6.3936 6.4236 3.8899 3.1773 2.2693 69 

695 7.8066 6.8688 6.8548 4.0729 3.3506 2.3140 695 

0.70 8.3718 —7 .3890 7.3284 —4.2683 3.5359 2.3609 0.70 
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TABLE FOR CALCULATING THE COVARIANCE MATRIX—continued 


n 


Faa Fab Fo Far Fry 
(+) (—) (+) (-) (+) 
0.30262 —0.27587 1.24338 —0 35626 1.17301 
30996 .28080 1.24515 36488 1.16807 
31783 .28609 1.24703 .37393 1.16314 
.32627 .29176 1.24905 38343 1.15824 
33534 .29786 1.25122 39343 1.15341 
0.34509 —0.30444 1.25359 .40397 - 1.14868 
35559 -31155 1.25618 .41510 1.14409 
36693 .31924 1.25903 42685 1.13967 
.37918 .32759 1.26219 -43928 1.13545 
39245 .33666 1.26571 45245 1.13148 
40684 —0.34655 1.26966 46642 0. 1.12779 
42246 .35735 1.27410 48126 1.12442 
.43947 .36918 1.27911 49705 1.12140 
.45800 .88215 1.28480 51386 1.11879 
47825 .39640 1.29126 .53180 1.11662 
0.50040 —0.41212 1.29864 0.55096 0. 1.11494 
.52469 .42947 1.30707 .57145 1.11378 
.55137 44868 1.31672 59341 1.11321 
.58075 47000 1.32781 .61696 1.11326 
.61315 49371 1.34056 64228 1.11400 
0.64898 —0.52015 1.35524 .66952 0. 1.11547 
68869 54971 1.37219 69889 1.11773 
-73279 58283 1.39177 .73061 1.12086 
.78190 62004 1.41445 .76493 1.12491 
.83673 .66196 1.44073 .80212 1.12996 
0.89809 —0.70931 1.47125 84250 0. 1.13608 
0.96696 .76295 1.50674 88643 1.14336 
1.04446 .82389 1.54810 .93432 1.15190 
1.13196 .89332 1.59638 0.98665 1.16180 
1.23103 0.97267 1.65283 1.04394 1.17317 
1.34357 —1.06365 1.71900 —1.10682 1.18614 
1.47183 1.16831 1.79672 1.17600 1.20085 
1.61853 1.28913 1.88822 1.25231 1.21745 
1.78695 1.42909 1.99624 1.33671 1.23613 
1.98103 1.59184 2.12410 1.43033 1.25708 
2.2056 —1.7819 2.2759 —1.5345 1.2805 
2.4665 2.0046 2.4567 1.6507 1.3067 
2.7711 2.2668 2.6728 1.7809 1.3360 
3.1284 2.5770 2.9321 1.9272 1.3686 
3.5495 2.9457 3.2444 2.0922 1.4051 
4.0485 —3.3862 3.6223 —2.2792 1.4457 
4.3326 3.6384 3.8407 2.3820 1.4678 
4.6432 3.9152 4.0817 2.4919 1.4912 
4.9833 4.2195 4.3481 2.6093 1.5159 
5.3563 4.5545 4.6431 2.7349 1.5420 
5.7660 —4.9238 4.9699 —2.8696 1.5697 
6.2168 5.3317 5.3328 3.0141 2: 1.5990 
6.7137 5.7829 5.7363 3.1693 2: 1.6301 
7.2624 6.2829 6.1855 3.3364 2: 1.6630 
7.8694 6.8381 6.6867 3.5164 1.6980 
8.5423 —7.4556 7.2466 ~3'7107 1.7351 


=7 
| 
0.30 
| 31 
31 31 
32 
.33 33 
0.35 
0.35 
.36 36 
.37 37 
.38 38 
.39 
0.40 
0.40 40 
Al Al 
42 2 
43 43 
44 
0.45 
0.45 
“46 46 
47 
.48 48 
49 
0.50 
| 0.50 
51 
52 
53 
0.55 
0.55 | | 
56 | | 
| | 
.58 | 58 
.59 
0.60 
0.60 | 
61 1 | 
62 62 | 
.63 83 | 
64 | 
0.65 
0.65 65 
66 66 
68 68 | 
69 
| | 9.70 
0.70 | 
705 | 70 
‘715 
715 
72 
0.725 
0.725 
73 
73 
| ‘745 
| | 
| | 0.75 | 
0.75 | | 
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feasible in physics (e.g. Newton’s Law of Cooling), difficult enough in 
biology and agriculture and extraordinarily difficult in demography. 
Even granted however that we may accept the proposed asymptotic 
formula, the standard errors of extrapolated values can be enormous. 
This will be seen from the first column of the table, showing the variance 
of the estimated asymptotic value of y. It will be observed that as r 
approaches the highest value tabulated, the variance begins to diverge 
sharply to very high values. 

The most reckless extrapolators are the demographers, who, provided 
with three or four census results, will undertake to predict, not only the 
course of the population over the next century, but also the asymptotic 
value towards which it is supposed to be tending. We believe that 
many, if not most, of these long range forecasts are of interest only as 
arithmetical exercises. 

On the other hand, if the value of 7 is smaller than the smallest value 
tabulated, the dependent variable will have practically reached its 
asymptotic value at x = 2. This implies that the rising (or falling) 
portion of the curve is not very well determined, although it is often this 
portion which has greatest practical interest as, for example, in discover- 
ing the most economic rate of application of a fertiliser. Generally, if 
the levels of x are well chosen, if necessary by means of a preliminary 
experiment, the value of r will fall within the limits of tabulation. , 

The rapidity with which the tables supply efficient estimates of the 
parameters will be illustrated by an example: A thermometer, lowered into 
a refrigerated hold, gave the following six consecutive readings (°F) at half- 
minute intervals: 


57.5 45.7 38.7 35.3 33.1 32.2 


Estimate the temperature of the hold and the standard error of this estimate 
and calculate a pair of fiducial limits. 

It may be expected that the errors in the readings will be very small. 
This indicates that it will be inadvisable to rely on the short method (2.9) 
for calculating the residual sum of squares. 

The preliminary estimate may be found from the formula 


y(0) — y(4) 

_ 45.7 — 32.2 
~ 57.5 — 33.1 


= 0.55 


> 
i 
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Values of (0.55)* and x(0.55)*"' are copied from the tables (not pro- 
vided here). 


x ad are-t y 
0 1.00000 0 57.5 
1 0.55000 1.00000 45.7 
2 .30250 1.10000 38.7 
3 . 16638 0.90750 35.3 
4 .09151 . 66550 33.1 
5 .05033 .45753 32.2 


Total, Y = 242.5 


Multiplying the columns, we have 


Y, = = 104.86457 
Y, = S(ar*™*-y) = 157.06527 
The {F} matrix is now copied from the table, with n = 6 and r = 0.55. 
1.62101 — 1.34608 — 1.40843 } 242.5 
— 1.34608 2.03212 0.89229 104.86457 
— 1.40843 0.89229 1.57912 / 157.06527 


Multiplying, in turn, each column of the matrix by the column on the 
right we find 


a = 30.723 
b = 26.821 
b(ér) = +0.05024 
br == +0.00187 
r = 0.55187 


The data may now be graduated with the formula 
y. = 30.723 + 26.821 (0.55187)* 


and the sum of squares of residuals calculated directly. 


| 
| 
| | 
| 
| 
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Ye Y — Ve 
0 57.5 57.544 —0.044 
1 45.7 45 .525 | +0.175 
2 38.7 38.892 —0.192 
3 35.3 35.231 | +0.069 
4 33.1 | 33.211 —0.111 
5 32.2 | 32.096 +0.104 
Total (should check to zero) +0.001 


Hence 


S(y — y.)° = 0.0973 
degrees of freedom =n—-3 =838 
estimated residual variance, s = 0.0973/3 


= 0.0324 


To find the estimated variance of the estimate of a, we multiply s° 
by F,. . A rough linear interpolation in the table is permissible: there 
is no justification for attempting high accuracy. Taking F,, = 1.65, 
we have estimated variance of a, 


s = 0.0324 1.65 


= 0.0535 


estimated standard deviation, s, = 0.23. 
We accordingly estimate the asymptotic temperature (i.e., the tem- 
perature of the hold) at 


30.72 °F. + 0.23 


Since the estimate of a is not normally distributed it is not strictly correct 
to use the distribution of Student’s ratio for finding limits, but this is the 
best we can do until the exact sampling distribution is discovered. 
With 3 degrees of freedom and level of significance = 5%, we have 
t = 3.18. Hence, with 95% fiducial probability, the true temperature 
lies within the range 


30.72 + (3.18)(0.23) = 29.99 --- 31.45 


We observe that the standard error of the estimated asymptotic 


f 
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value is comparatively large (greater than the standard error of a single 
observation) even in this example where we approach relatively close to 
the asymptote. 


4. GENERAL METHOD 


When the ordinates are not equally spaced, the solution still follows 
the pattern of the example just discussed, the only difference being that 
tables are not available. It therefore becomes necessary to calculate 
the modified information matrix (2.4), using the preliminary estimate of 
p. Inversion of this matrix gives us the matrix {F} and the remainder 
of the solution proceeds as before. 


5, AN EXAMPLE OF A FERTILISER TRIAL 


To illustrate the process of fitting an asymptotic regression curve to 
results of a field trial, we use some data of an experiment in which lime 
was applied at five levels: 0, 2, 4, 6 and 8 tons per hectare. The design 
was a Latin Square. 

The preliminary estimate of r (0.67) happened to be rather a bad 
one, with an error of +0.09. A much better preliminary estimate would 
have been given by a free-hand graph of the data. By this method, we 
obtained r = 0.62, with an error of +0.02. The method is of course sub- 
jective and the reader should therefore try it for himself. Under the 
circumstances, it was decided to apply the process of approximation a 
second time though really, when one compares the corrections with the 
standard errors of the estimates, the second application can hardly be 
considered as essential. 

The asymptotic regression is adjusted to the treatment subtotals, 
diminished by 60. The sum of squares for regression must accordingly be 
divided by five, since each subtotal is based on five plots. The sum of 
squares for deviations from regression was obtained by subtraction and 
does not differ appreciably from the sum of squares calculated from the 
residuals. The mean square for deviations from regression is smaller 
than the error mean square (though not significantly so), showing that 
the data are adequately represented by the proposed regression curve. 

The regression may accordingly be taken as 


y = 72.43 — 28.24 (0.597)" 


Here y is in terms of the yield of five plots and z is in units of two tons per 
hectare. A small modification would be necessary to express both vari- 
ables in more convenient units. Possibly also the experimenter would 
prefer one of the other forms of the equation. 

To find the estimated covariance matrix we divide the last row and 
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column of the tabulated matrix by b (and hence the F,, by 6°), after- 
wards multiplying each term by five times the estimated error variance 
given by the analysis of variance (five times, because the regression was 
fitted to the subtotals of five plots). 

The standard errors may seem surprisingly large. but it must be 
remembered that the estimates are very highly correlated, which means 
that the errors made in adjusting the curve are not as great as would at 
first appear. Thus, for example, the expected value of y corresponding 
to x = 0, has an estimated variance of 


31.0 + 2(—27.6) + 29.3 = 5.1 


and hence, estimate of standard error = 2.3 


YIELD OF WHEAT (KGS.). 25 PLOTS IN A LATIN SQUARE 


(D) 17.1 (E) 14.4 (A) 11.1 (C) 12.7 (B) 9.7 | 65.0 
(E) 14.7 (B) 11.2 (C) 11.2 (A) 8.0 | (D)10.0 ~~ 55.1 
(A) 10.5 (C) 13.9 (B) 11.2 (D) 13.4 (E) 14.7 | 63.7 
(C) 14.8 (D) 12.4 (E) 13.7 (B) 10.4 (A) 7.4 | 58.7 
(B) 12.1 (A) 7.4 (D) 12.8 (E, 11.4 (C) 11.2 | 54.9 
69.2 59.3 60.0 55.9 53.0 | 297.4 
| 

Designation: (A) (B) (C) (D) (E) 

Tons lime/hectare: 0 2 4 6 | 8 
Sub totals 44.4 54.6 63.8 65.7 68.9 


Preliminary estimate, 


_ () — (B) _ 68.9 — 54.6 
"= (D) — (A) 65.7 — 44.4 


= 0.670 


z(0.67)*"* 


se 
— — — = —— = 
| 
2 
(0.67) | y 
| | 
0 1.00000 0 —15.6 
1 | 0.67000 1.00000 — 5.4 es 
2 44890 1.34000 + 3.8 
3 . 30076 1.34670 + 5.7 
.20151 1.20305 + 8.9 
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{F} matrix for r = 0.67 — 


+ 13.6681 — 12.4850 —7.0792}- 26 =Y 
— 12.4850 + 12.2564 +6.1964 —14.0044 = Y, 
— 7.0792 + 6.1964 +3.9172 / +18.0753 = Y, 
Hence: 
a = 11.349 + 60 = 71.349 
b = —27.181 
b(ér) = + 2.434 
— 0.0895 
r = 0.670 — 0.0895 = 0.580 
Starting from r = 0.58 and repeating the process, we find 
Y = — 2.6 a = 12.433 + 60 = 72.433 - 
Y, = —15.3344 b = —28.244 
Y, = +11.7064 b(ér) = — 0.487 (r = 0.5972) 
Hence, 
S(y?) = (—26)(12.433) + (—15.3344)(— 28.244) 
+ (11.7064)(—0.487) 
= 395.08 
= 1.35 
Sy. - = 303.73 


Dividing by five, to express in terms of a single plot we have: 
Sum of squares due to regression = 393.73/5 = 78.75 


4 
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ANALYSIS OF VARIANCE 


degrees of sums of mean 
Source of variation freedom squares square 
Rows 4 17.81 4.45 
Columns 4 29.92 7.48 
Treatments: Regression 2 78.7. 39.38 
Deviations from regression 2 0.71 0.36 
Treatments: Total + 79.46 
Remainder 12 12.64 1.053 
Total 24 139.83 


ESTIMATE OF THE COVARIANG@E MATRIX 


Taking r = 0.6, we find the estimated covariance matrix: 


5.89 —5.24 0.1398 | 
(5)(1.053)< —5.24 5.56 —0.1444 


0.1398 —0.1444 


| 31.0 6 0.736 
| = —27.6 29.3 —0.760 
| 0.736 — 0.760 — 0.0196 
Hence 

a= 72.43 + 5.57 

B = —28.24 + 5.41 

p= 0.597 + 0.140 


6. EFFICIENCY OF GROUPING 


We noted earlier that textbooks often recommend the method of 
estimation in which the series is divided into three equal groups (one or 
two observations being thrown away if necessary) and an asymptotic 
formula is fitted exactly to the three subtotals or group means. This may 
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be regarded as the extreme example of the general method of grouping 
into n equal groups. Thus, if we have N = kn observations, we may 
form the means of n consecutive groups each of k observations, 


Yo = (Yo + + Ye-i)/k 
n= (Yi + Year You-1)/k 


ete. 
If the regression equations for the individual observations is 
y= at 
where 
X=0,1,-:--,N-1 


then the regression equation for the group means will be an equation of 
the same type, namely 


=at+ Bip, 
where 
z =0,1, 2-1 
and 
g, = + 
p) 
k 
Pr = p (6.1) 


It is clear that, if we can estimate the parameters of the new equa- 
tion, we can easily pass to the estimates of parameters of the original 
equation. When x = 3, no problem of statistical estimation remains, 
since we shall have three equations for the three parameters. 

We shall now assess the loss of information consequent on grouping 
the data in the manner described. We assume that, after grouping, the 
estimates are obtained efficiently, which includes the special case of 
n = 3, where the consistent estimates are necessarily efficient. 

In order to permit the number of observations to tend to infinity, it 
will be convenient to replace the variable x by a variable z in the range 
0 tol. The regression equation takes, in the limit, the form 


y=at+ Bo (6.2) 
while the information matrix, per observation, tends to 


— 
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1 bh, 
{h} = he hop bh, (6.3) 


bh, bhy, b’h,, 


where the terms, in the form most convenient for machine computation, 
may be expressed: 


hay = (r — 1)/logr 

hey = — 1)/2 logr 

har = (1 — has/r)/log r (6.4) 
hor = — hyp/r)/2 log r 


h,, = (1 — 2h,,/r)/2 logr 


' Now suppose that the observations are divided in n equal groups. 
Then the means of these groups will follow the regression equation: 


y =at 
=0,1,---,n—1 
where (6.5) 
Bn(p'" — 1) 
"log p 
l/n 
Pn = p 


The matrix of information about a, 8, and p, , per observation, will 
be given by {Z}/n where {J} is the information matrix of n observations 
given in (2.3). To form the matrix of information about a, B and p it 
remains only to multiply by the square of the matrix of partial differen- 
tial coefficients: 


Bn» Pn) 
B, p) 
The information matrices for the grouped and ungrouped data respec- 
tively may then be compared and efficiency of estimation of any param- 
eter or function of the parameters directly determined. 
However, it would seem desirable to construct a single measure of 
average efficiency with respect to the simultaneous estimation of the 


= 
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three parameters. The quantity which suggests itself for this purpose is 
the cube root of the ratio of the determinants of the information matrices, 


Below are shown the efficiencies when data are grouped respectively in 
three and six groups, for typical values of p near the middle and ends of 
the range in which it might be useful to fit an asymptotic regression 
formula. 


three groups six groups 
p ps E ps E 
0.00024 414 0.0625 36% 0.25 82% 
0.01562 5 0.25 58% 0.5 88% 
0.11764 9 0.49 71% 0.7 93% 


Actually, if the number of observations is small, the loss of informa- 
tion per observation will be a little less. On the other hand, in this case, 
we would also have to take account of the information lost in rejected 
observations when their number is not exactly divisible by three or six, 
so the net loss of information may in fact be greater with small numbers 
than in the limit. 

It appears then that many statistical textbooks in current use 
recommend a method which is equivalent to throwing away from 30 to 
60% of the data! 

Two further questions remain to be examined briefly. Firstly, what 
is the best method of finding a preliminary estimate of p? For example, 
with six observations, the method of grouping in three groups gives 


Ys + Y2 — Yi — Yo 


Another method, which was used in the example of section (3), is to take 


— 
Ys — Yo 


As it is advantageous to start with a fairly good preliminary estimate, 
the variances of the different methods might be compared. We are 
however of the opinion that the best practical method is to draw a free- 
hand graph through the points and to determine r from three conven- 
iently chosen points on the graph. 
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Secondly, we may ask if the tables given in this paper are of any help 
when we have more than seven equally spaced ordinates. If the number 
of observations is a multiple of 5, 6, or 7, we may employ the method of 
grouping already described. There will be some loss of information but 
not as much as in the extreme case of only three groups. The efficiencies 
for the case of six groups have been given above. With more groups, the 
efficiencies steadily rise. 

This suggests that, if the tabulation could be extended to, say, 
n = 12, most problems encountered in practice could be solved without 
appreciable loss of information. That we have not provided more 
extensive tables is due to the heaviness of the computational work 
required and to the hope that the range provided here will be of immedi- 
ate utility in agronomic and biological experimentation. 

In conclusion, it may be noted that the methods described in this 
paper may be generalised to deal with any regression of the type 


=a+t Bf(p, x) 


where f(p, x) is any function of a parameter, p, and the independent 
variable. 
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TESTING THE SOLUBILITY OF MAXIMUM LIKELIHOOD 
EQUATIONS IN THE ROUTINE APPLICATION 
OF SCORING METHODS 


Norman T. J. Battey 
Department of Human Ecology, University of Cambridge 


1. GENERAL CONSIDERATIONS 


i Is A common experience that the use of the method of maximum 
likelihood for estimating unknown parameters often leads to likelihood 
equations for which explicit solutions are absent or unknown. The 
standard procedure, first developed by Fisher (e.g. 1946) and further 
discussed by Rao (1948), is then to use an iterative process based on the 
calculation of scores and information functions. Suppose L is the 
logarithm of the likelihood, and that there are s parameters to be esti- 
mated, = 1, --- , s. Scores, S,; , are defined by 


S; = (1) 
The information matrix is given by the quantities, J;; , where 


The required maximum likelihood estimates are 6; , the solutions of the 
s equations 


S; = 0. (3) 


If 6; represents a set of approximate solutions, then a more accurate set 
is given by 


D1'S,, (4) 
i=l 
where {.“’} is the inverse of the information matrix. This process may 
be repeated, calculating new scores S; for the new set of estimates 6; . 
If the corrections are small it is in practice usually unnecessary to recal- 
culate the information matrix, as this changes relatively slowly. 
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In general, the s equations in s unknowns, given by (3) will be 
uniquely soluble. However, if the observations are derived from a 
multinomial distribution, as is frequently the case in genetics for 
example, the scores are linear functions of the observations. It is 
shown below that, unless there are at least s functionally independent 
expectations, the scores will be linearly dependent and unique solution 
of the likelihood equations impossible. 

While a professional statistician may be expected to have a proper 
appreciation of such a situation when it arises, it is possible that the 
routine application of the method of scoring by research workers in 
other fields may sometimes give rise to difficulties. It is not always 
obvious when the scores appropriate to some chosen mathematical model 
are linearly dependent; neither are functional relationships between the 
expectations necessarily immediately apparent. In such a case, how- 
ever, the matrix of expected amounts of information is singular and has 
no inverse; the matrix of observed amounts of information, on the other 
hand, will not be singular in general but should lead to very unlikely 
values of standard errors. Thus an indication that the mathematical 
model used is inadequate will usually be given by the attempted inver- 
sion of the information matrix. Nevertheless, not only is it possible that 
anomalous standard errors might be overlooked, but it is most desirable 
that the failure of the model used should be discovered before any actual 
computations are started. 

The present note gives a convenient test for the solubility of the 
maximum likelihood equations. It also demonstrates the somewhat 
obvious but extremely useful result that if the number of degrees of 
freedom is exactly equal to the number of parameters to be estimated, 
then, subject to the condition of solubility, the maximum likelihood 
estimates of the parameters can be obtained from the equations given by 
setting the observations equal to their expectations. The latter equa- 
tions are often simpler to solve. Two practical illustrations are given. 


2. THEORETICAL DISCUSSION 


Suppose that n observations fall into classes containing a, , a2 --- , 
a, individuals, whose expectations are m, , m2, ,--:,m,. Thus we have 
the restrictions 


= (5) 


If the expectations are functions of the s parameters 6; then the s 
maximum likelihood equations (3) may be written. 
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k k 
S. = (log m) = = 0. 6) 
If the likelihood equations are to be independent then the rank of the 
s X k matrix {0/00; (log m;)}, which is a Jacobian, must be not less 
than s. This implies that at least s of the m; must be functionally 
independent, for if the matrix has rank r, then r only of the expectations 
are functionally independent (e.g. Turnbull, 1945, pp. 126-7). We also 


have 
k 


1 Om, Om, 
I;; = (7) 
It is easy to see from (7) that if {0/00@; (log m;)} has rank less than s, 
then {J;;} also has rank less than s, i.e. the expected information matrix 
is singular and has no inverse. 

On account of the restriction (5) not more than (k — 1) expectations 
can be independent, and so we cannot hope to estimate more than 
(k — 1) parameters. Suppose now that we have just this number of 
parameters, i.e. s = k — 1. Consider the k equations 


m; = (8) 


Now if the Jacobian {0/006; (log m;)} has rank s, not only are the maxi- 
mum likelihood equations (6) uniquely soluble, but so are the equations 
(8). Moreover equations (6) are easily seen to be satisfied on substitut- 
ing m, for a; . Thus (6) and (8) are equivalent. 


3. ILLUSTRATIVE EXAMPLES 


Although it is easy to construct simpler illustrations of the foregoing 
principles, the following examples have been chosen because they actu- 
ally arose in practice. 


Example 1. 


Let us consider the problem of estimating genetic linkage from back- 
cross matings, all of coupling phase, when both pairs of factors involved 
are affected by differential viability. This problem was considered by 
Bailey (1949). Briefly, we have matings of type AB/ab X ab/ab, whose 
offspring fall into four classes represented by AB, Ab. aB and ab. In the 
absence of viability disturbance the expected numbers in these classes 
are in the ratio g: p: p: q, where p = (1 — q) is the recombination 
fraction to be estimated. If the viabilities of A with respect to a, and of 
B with respect to b, are u and v respectively, then assuming the viability 
effects to operate independently the actual expectations will be as in 
Table I. 
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TABLE I, 
Phenotype AB Ab aB ab Total 
Expected nuvg/R nup/R nog/R nq/R n 
Observed a b c d n 
where R = (w + t+ (ut+v)p. (9) 


The expectations of the observations a, b, c and d may be written as 
mM, , Mz , Mz and m, respectively. Bailey obtained the equations 


a=m, b =m, d=m,, (10) 


by manipulating the maximum likelihood equations, but as there are 
three degrees of freedom and three disposable parameters the result 
could have been written down immediately. It is easy to check the 
functional independence of three of the expectations by showing that 
the rank of the appropriate Jacobian matrix is three. 


Example 2. 


Difficulties of the type discussed in this paper often arise in problems 
of bacterial genetics. The practical technique of obtaining recombinant 
types involves selecting only organisms for which recombination has 
occurred over certain segments. Thus in matings of type Ab X aB we 
may be able to select progeny of type AB only—-recombination over 
this segment is then said to be compulsory. Suppose now that we have 
two further loci within this segment, with alternatives x and 2’ at the 
first locus, and y and y’ at the second. A suitable four-point cross is of 
type 


AzybXaa'yB 


All the progeny obtained must be of type AB, but are 2’y’, x y’, xy or x’y 
according as there is recombination over the first, second, or third 
sub-segments or over all three, respectively. As in Example 1 the four 
types of scorable progeny may be affected by differential viability. A 
convenient way of coping with the difficulty is to run two types of cross, 


AxybXaz'y’B 
(11) 
2, Ax’y’bxXaxryB 


Bailey (1951) has discussed this problem and gave a test of significance 
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for the type of differential viability postulated in Example 1. He stated 
that the obvious method of assigning the six available degrees of freedom 
to the three recombinations fractions and three differential viabilities (as 
between four types of offspring) failed to give soluble maximum likeli- 
hood equations. The reasons for this follow from the treatment now 
under discussion. 

Let the recombinations fractions for the three sub-segments be 
\, w and 7; and let the viabilities of x’y’, xy’ and zy, relative to z’y be 
u, v and w respectively. The observations and expectations are then as 
shown in Table IT. 


TABLE II. 
1, Azy b X az'y’B 
Rec Phenotype Expected Observed 
over 
1 z'y’ | nur(l — p)(1 — v»)/Ri m a 
2 ay’ mv(1 — Ajuw(1l — v)/R, me b 
3 xy mw(1 — — ms c 
123 m /R, m d 
2. Ax'y'b X ary B 
Rec. 
Phenotype Expected Observed 
over 
1 ry mwr(l — — »)/Re ms e 
2 — — v)/Re me 
3 mu(1 — A)(1 — mz g 
123 ry’ novrpv / Re Ms h 
Total n ne 
where 
R, = — pl — ») + — — ») + 


and (12) 
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The Jacobian matrix is somewhat more complex this time, but it can 
be readily simplified by various obvious manipulations of rows and 
columns, and it is not difficult to show that its rank is five. It is worth 
remarking that a useful short cut can often be made at this stage by 
inserting in the Jacobian matrix the values of the parameters in the 
absence of linkage and viability effects, i.e. p = 3, u = 1, etc. It is 
usually very much easier to find the rank of the resulting matrix with 
simple arithmetical quantities than of the original matrix with the 
algebraic elements. 

In the present example there are evidently only five independent 
expectations. The requirements of a fixed total for each mating type 
give two functional relationships. 


m +m, + ms + m=n, (13) 
and ms; + me + m, + Ms = Ne - (14) 
A third, easily obtained from the inspection of Table IT is 

= (15) 


We could evidently choose m, , m, , m3 , m; and mz as an independent set. 
If we can spot a sufficient number of independent functional relation- 
ships we shall be able to avoid examining the Jacobian—as we could in 
the present example where there are evidently not more than five 
independent expectations, but we have six disposable parameters. 

The practical consequences of the failure of the mathematical model 
are either that we must reduce the number of disposable parameters 
to a point where a unique solution is possible, or that we must increase 
the complexity of the experimental design so as to give a sufficient num- 
ber of functionally independent expectations. In the above examples 
we can simplify the model by postulating only two differential viabilities 
—of x’ with respect to z, and of y’ with respect to y, these acting inde- 
pendently. We can then estimate the five parameters satisfactorily. 
Alternatively, we can retain three differential viabilities, but extend the 
design to include at least one of the further mating types 


3. Ax’ ybxXazy'’B 
4. Axy’bXazx'yB 


The scores for the original six parameters will then be found to be 
linearly independent. 
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Corrigenda: “Rectangular lattices and partially balanced incomplete block designs” 
by K. R. Nair, Biometrics 7: 145-154, 1951. 
(1) Page 147, the last figure in the second row of p?;;, should be 1 instead of 0. 
(2) Page 147, the last figure in the second row of p*;, should be 0 instead of 1. 
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SUR CERTAINS PROBLEMES D’ESTIMATION DANS LES CAS 
DE DOUBLE ECHANTILLONNAGE. 


M. Lamorre et M. ScuouTzENBERGER 


Laboratoire de Génétique, Faculté des Sciences, Paris 
et Centre de Génétique, Hopital St. Lowis, Paris 


I, INTRODUCTION 


esa espéce polymorphe pour un certain caractére est répartie en 
colonies plus ou moins isolées, la fréquence du caractére varie d’une 
colonie 4 l’autre. I] est intéressant de déterminer la forme de la distribu- 
tion de ces fréquences, mais cette détermination pose un probléme statis- 
tique complexe. On ne peut étudier, en effet, qu’un nombre limité de 
colonies et chacune d’elles n’est connue qu’imparfaitement par |’ana- 
lyse d’un échantillon prélevé sur son effectif total. Il y a done deux 
échantillonnages successifs: l’un au niveau de chaque colonie, l’autre au 
niveau de l’ensemble de toutes les colonies de l’espéce. 

Un tel probléme de double échantillonnage se pose d’ailleurs dans 
d’autres domaines que |’étude des populations naturelles. On le ren- 
contre notamment en Physiologie ou en Psychologie expérimentale, 
lorsque chaque individu d’un groupe est caractérisé par la probabilité 
qu’il a de répondre d’une certaine maniére 4 un stimulus donné. Pour 
étudier le groupe, on ne peut évidemment analyser le comportement que 
d’un nombre limité d’individus, et chacun d’eux n’est soumis qu’A un 
nombre fini de tests. 

D’autres exemples pourraient étre donnés et c’est seulement pour des 
raisons de commodité que nous utiliserons ici le langage correspondant 
au cas de la distribution d’un caractére qualitatif dans les colonies d’une 
espéce. 

Nous ferons les hypotheses suivantes: 

(a) l’effectif N des colonies est le méme pour toutes; 

(b) Veffectif n de Véchantillon prélevé dans chacune d’elles est 

également partout le méme. 
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II, LA DISTRIBUTION D'ECHANTILLONNAGE 


Soit f(p) la fonction de probabilité de la fréquence p du caractére dans 
l’ensemble des colonies et soit f*(r) la loi de distribution de la fréquence r 
du caractére dans les » échantillons. Alors que f(p) est définie pour 
toutes les valeurs 0, 1/N, 2/N, i/N, (N — 1)/N, 1,—ou pour toute 
valeur de p entre 0 et 1 si N peut étre considéré comme infini—, f*(r) 
n’est définie que pour les valeurs 0, 1/n, 2/n, --- k/n, --- (n — 1)/n, 1; 
f*(r) est donc toujours groupée en classes plus larges que f(p). 

Soit g(r, p) la probabilité pour que l’on observe une fréquence 
r(=k/n) quand l’échantillon est prélevé dans une colonie ot la fréquence 
est p = i/N. Le théoréme des probabilités composées permet d’écrire: 


(II, Ta) = 
ou 
(II, Tb) = ob, ap 


selon que JN est fini ou infini. 

En pratique, on n’aura 4 envisager pour la loi g(r, p) que les cas sui- 
vants: 
Cas hypergéométrique, si N n’est pas trés grand par rapport a n: 


N —n) — 0)! 


Cas binomial, si N est trés grand par rapport 4 n: 


Il est intéressant de remarquer que les valeurs de f*(0) et de f*(Z) sont 
toujours au moins égales aux valeurs respectives f(0) et f(1), puisque l’on 


a toujours g (0,0) = 1 et g(1, 1) = 1, quelle que soit la loi de probabilité 
g(r, p). Ona: 


= fO) + g(0, ft/N) 


On notera également que, si f(p) est symétrique par rapport a la 
valeur 3, il en est de méme pour f*(r). 


N 
N-1 
i=0 
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Fonction caractéristique eb moments de la distribution d’échantillonnage. 
La fonction caractéristique de la distribution de f*(r) est définie par: 


¢*(t) e*/™* t(k/n) 


On a done: 


n N 
eX) = i/N)46/N) 
(II, 3) 
N n N 
= f@/N) i/N) = fi/N), 
c’est-a-dire que la fonction caractéristique de f*(r) est égale 4 la moyenne, 
étendus & toutes les colonies, de la fonction caractéristique y,,, attachée a 
la loi du deuxiéme échantillonnage. 
De cette relation générale entre les fonctions caractéristiques g(t) et 


y(t) on déduit facilement l’expression des moments autour de zéro de 
f*(r) en fonction des mémes moments de f(p): 


Cas hypergéométrique: 
== 


_ Na — + (N — nu 


= n(N — 1) 
— Na Ir — 2)us + — — 
n(N — 1)(N — 2) 

4 W = nN = 


n'(N — 1)(N — 2) 

(II, 4a) 
« — + 
- n(N — 1)(N — 2)(N — 3) 


4 NW — n(n — 1)[7(N — 2n) + 8n + I] 
n(N — 1)(N — 2)(N — 3) 


4 = WW = 2n)(N 3n) — Nv 
n'(N — 1)(N — 2)(N — 3) 
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Cas binomial: 


* 


= 


(n — 1)ue By 


= 


*= (n — 1)(m — 2)us 3(n — + 


n 
(II, 4b) 
= 1)(n — 2)(n — 3)us + 6(r — — 2)us 
n® 
7(n — Wyo + 
i n! 
pit 


ou les 8}, sont les nombres de stirling de seconde espéce. 


Pour h < n, les formules (II, 4a) et (II, 4b) donnent les valeurs de u% 
en fonction des moments yu; d’ordre 7 < h. Pour h > n, elles donnent 
les valeurs de »% en fonction des moments y; d’ordre j < n (avec n < h) 
seulement. En effet, seuls les n premiers moments de f*(r) sont indépen- 
dants, et ses moments d’ordre supérieur 4 n peuvent étre exprimés en 
fonction des n premiers, puisque f*(r) ne prend que n + 1 valeurs. 

Inversement, on peut aussi connaitre l’expression des moments autour 


de zéro de la loi f(p) en fonction des mémes moments de f*(r): 


Cas hypergéométrique: 


_ n(N = — (N 
N(n — 1) 


n'(N — 1)(N — 2)u¥ — 3n(N — 1)(N — 
N*(n — 1)(n — 2) 


(II, 5a) 
(N — — 


— — 2) 


a 
n 
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_ — 
mn Nin — Din — 2)(n — 3) 


mN — — n\(11N — 7n — 


— Din — 2(n — 3) 


_ (N=n)(2N —n)(8N—n) —(N—-n)+(n— IN 
N*(n — 1)(n — 2)(n — 3) 


Cas binomial: 


= 
_ mut — 
n-1 


_ — + 
(II, 5b) ” (n — 1)(n — 2) 


_ — 6n'uk + — byt 


“(a = 2)(n — 3) 


ov les Sj sont les nombres de stirling de premiére espéce. 

Les formules (II, 5a) et (II, 5b) donnent les valeurs des y, en fonction 
des u* d’ordrej < hlorsqueh < n. Pour h > n, ces formules deviennent 
illusoires: la connaissance de la distribution f*(r) ne permet pas d’obtenir 
la suite compléte des moments de f(p), mais seulement ceux d’ordre au 
plus égal 4 n. 


III. ESTIMATION DES CARACTERISTIQUES DE LA DISTRIBUTION DES COLONIES 


L’observation fournit directement les valeurs de f*(r) pour les di- 
verses valeurs 0, 1/n, --- (n — 1)/n, 1 des fréquences dans les v échan- 
tillons de n individus. On peut alors calculer les valeurs m*, des moments, 
jusqu’a l’ordre n, de la distribution expérimentale. Par définition, ces 
m* sont des estimations correctes (unbiased) des u* et, par conséquent, 
les valeurs m, obtenues & partir des m¥, par les formules II,5, linéaires, 


PES 
a 
(n— WIS 
i=1 
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seront aussi des estimations correctes des moments yu, de la fonction de 
probabilité. On remarque que ces estimations sont indépendantes du 
nombre v des colonies prospectées. 

Il n’en serait plus de méme pour des caractéristiques dont l’expression 
ne serait pas une fonction linéaire des ut. Mais on pourra remarquer 
alors que les formules II,I montrent que le double échantillonnage est 
équivalent 4 un échantillonnage simple ou: 

(i) la loi f(p) est remplacée par la loi f*(r), et 

(ii) chaque échantillon ne compte que pour une observation. 

I] sera donc possible, en utilisant les formules classiques, de construire 
des estimations correctes des caractéristiques de f(p). 

IV. AJUSTEMENT D'UNE COURBE DE FREQUENCES A DES RESULTATS 
EXPERIMENTAUX 

Pour remonter des résultats expérimentaux 4 la loi de probabilité 
S(i/N), on dispose de n + 1 valeurs de f*(r) pour r = 0, 1/n, --- 
(n — 1)/n et 1, d’ot l’on déduit les moments m% , m*, --- m* et les 
estimations my ,m,,°** m,. 

Il est possible, 4 partir de ces n + 1 quantités, d’ajuster une courbe 
f(i/N) de forme donnée a priori, et dont l’expression comprenne moins 
de n + 1 paramétres. 

Si l’on recherche une approximation de f(z/N) par une suite de 
polynomes f, de degré h (inférieur 4 n), on imposera naturellement, pour 
déterminer le polynome f, , les h + 1 conditions: 


N 
(4) f(i/N) = m, pour 7 = 0,1, 2,---,n” 


i=O0 
ou, dans le cas binomial ot! N peut étre considéré comme infini: 
1 
fp) = pour 7 = 0, 1, 2, 
0 
Dans ce dernier cas, |’emploi des polynomes de legendre,—ow la 


variable x habituelle est remplacée par 2p — 1—, facilitera beaucoup les 
calculs. En effet, si l’on veut ajuster une courbe de degré h, on prendra: 


fp) = 1+ + + + + + 


ot les P; sont les polynomes de legendre avec variable transformée 
p = (1 + 2)/2 et les c; des coefficients, dépendant des résultats expéri- 
mentaux. Le tableau qui suit donne les expressions des polynomes P; 
et des coefficients c; en fonction des moments m* de la distribution 
observée des échantillons, pour j < 6: 
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1 


P, = Gp’ — Gp +1 

| P, = 20p’ — + 12p — 1 

P, = 70p' — 140p' + 90p? — 20p + 1 

= 252p" — 630p' + 560p" — 210p* + 30p — 1 

| = 924p" — 2772p’ + 3150p' — 1680p" + 420p? — 42p + 1 


| ¢,/3 = 2m* -1 


° 

ll 


(6nm¥ — Gnm*)(n — + 1 
= [20n’?m* — 30n?m* + (12n? — + 4)m*] 
[nm — — 1 


c,/9 = [70n'm* — 140n*m* + (90n* — + 50n)m*t 
| — (20n? — + 50n)m*][(n — — 2)(n — 1 
c;/11 = [252n*m* — 630n'm* + (560n' — 140n? + 420n7)m* 


— (210n* — 210n* + 630n”)m* 
+ (30n* — + 280n? — 100n + 48)m*] 
- [(n — 1)(n — 2)(n — 3)(rn — 4] ' - 1 
= [924n’m* — 2772n>m* + (3150n° — 630n* + 2940n*) 
— (1680n° — 1260n* + 5880n")m* 
+ (420n? — 840n' + 3990n? — 1050n? + 1176n)m# 
— (42n’ — 210n' + 1050n® — 1050n? + 1176n)m*] 
- — Yr — 2)(n — — — 


V. SOMMAIRE 


Nous avons été amenés, dans |’étude d’une population répartie en 
groupes distincts dont on ne peut observer qu’un nombre limité, et 
chacun seulement par un échantillon, 4 résoudre certains problémes de 
double échantillonnage. Nous avons montré comment obtenir, 4 partir 
des résultats d’observation, des estimations correctes des moments de 
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la loi de distribution de la fréquence d’un caractére qualitatif dans la 
population. 

La connaissance de ces estimations permet d’ajuster aux résultats 
expérimentaux une courbe de fréquences dont l’expression comporte un 
nombre de paramétres au plus égal 4 l’effectif des échantillons. 


VI. SUMMARY 


In a study of a population distributed in distinct groups of which 
only a limited number can be observed, each by a sample, it has been 
necessary to solve certain problems of double sampling. 

A demonstration has been made of how to obtain from the results of 
these observations unbiased estimates of the moments of the frequency 
function of a character in the population. 

Knowing these estimates, it is possible to fit to the experimental 
results a frequency curve the equation of which includes at the most a 
number of parameters equal to the size of a sample. 


% 
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THE CALCULATION OF x’ FOR AN r Xe 
CONTINGENCY TABLE 


P. H. 


Bureau of Animal Population, Department of Zoological Field Studies, Oxford 


7 following is a perfectly general method of calculating x’ for an 
r X ¢ contingency table, though it is especially useful when both r 
and c > 2. Since the method avoids the direct calculation of the ex- 
pected frequencies, a considerable amount of time may be saved, par- 
ticularly if the contingency table is of a large order, or the values of x” 
are required as quickly as possible for a number of tables. 

To take a simple 4 X 3 table as an example, suppose 0;; is the 
observed frequency in the 7th row and jth column, R; and C; the totals 
for the individual rows and columns, and N the total number of observa- 
tions. As a first step, putting X; = 1/R; and Y; = 1/C; , we rewrite the 
original table in the form 


X; 
051 O52 05s X; 
O% 


Y2 Y; 


This can be done very quickly with the help of a table of squares and 

reciprocals, such as Barlow’s Tables of Squares, Cubes, etc., or Fisher & 

Yates’ Statistical Tables for Biological, Agricultural and Medical Re- 

search, Tables XX VII and XXIX. 
Then 


= N{ + + + X,03;) 


+ Y.(X,0;2 + + X;052 + X 


+ Y3(X,0i3 + + x0 + X03) 1} 
283 
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= v( > A; - 1), say. 
1 


Thus, it may be seen by inspection that in order to obtain A, it is 
only necessary to multiply each entry in the first column of the 03; 
matrix by the appropriate entry in the column of reciprocals of the row 
totals, accumulating the sum of the products on the machine, and then 
multiply this result by the reciprocal at the foot of the column. Simi- 
larly A, and A; are obtained from the second and third columns. The 
sum of these quantities, less unity, is then multiplied by N to give x’. 

The procedure has been illustrated here in terms of the columns of 
the 0{; matrix, but the whole computation could have been equally 
carried out in terms of the rows. The latter would have been the 
method of choice if there had been a greater number of columns than 
rows in the example. More generally, if p be defined as a row vector 
consisting of the r reciprocals of the row totals, and y a column vector 
formed by the c reciprocals of the column totals, 2 being the r X c matrix 
of the squares of the observed frequencies, then 


x” = N(pQy — 1); 


and the calculation of the scalar pQy can be carried out in any way which 
seems the most convenient and rapid. 

The following numerical example occurred in:an analysis of some 
data for the brown rat, Rattus norvegicus. This was only one out of a 
number of such tables, and the frequencies represent the observed num- 
ber of pregnancies classified according to three stages of embryonic 
development (S;), and to four body weight classes of the pregnant 
2 9(W,). 


Si S2 Ss Total 
WM 24 7 7 38 
WwW: 76 38 70 184 
Ws 69 32 82 183 
WwW. 27 9 55 91 
Total 196 86 214 496 


In rewriting the original table in the form suitable for calculation, a 
fairly large number of decimal places should be retained in the reciprocals 


| 
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of the row and column totals. Thus, in the present example, we have the 
following table. 


576 49 | 49 0.0263158 
5776 1444 | 4900 0.0054348 
4761 1024 6724 0.0054645 
729 81 | 3025 0.0109890 
0.0051020 0.0116279 | 0.0046729 
Then A, = (0.0263158 X 576 + --- + 0.0109890 X 729) x 
0.0051020 
= 0.411103 
Similarly A, = 0.181664 
A; = 0.457500 


Total 


1.050267 and xis) = 496 X 0.050267 = 24.93. 


Starting at the time the original table was copied down on a comput- 
ing sheet, the whole calculation took 9 minutes, using a hand machine. 
The whole procedure was repeated, this time the expected frequencies 
(E;;) being calculated in the usual way from the marginal totals, and x? 
computed by means of 


= 
N. 


This was found to take 14 minutes, so that there was a saving of 5 
minutes through adopting the first method. Thus, given a number of 
such 4 X 3 tables, the saving of time becomes appreciable. Moreover, 
it seems likely that, as the order of the contingency table becomes 
greater, the proportionate saving of time also increases. For example, 
the ratio of times was 11: 19 in the case of a 4 X 4 table for which x’ 
was also calculated by the ordinary method of taking the deviations 
from expectations, with x? = >> (0;; — E,;)°/E;; . 

The disadvantage of the method lies, of course, in the fact that we 
have not calculated the expected frequencies, so that in the event of an 
excessive value of x’, we are unable to see immediately in which class 
or classes the discrepancy is occurring. Nevertheless, we may still be 
able to obtain some indication of where to start looking for the depar- 
tures from expectation, if these are not obvious by inspection, as in the 
above example. Thus, using these figures merely as an illustration, the 
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contribution to the total x° by the first column is given by 
NA, — C, = 7.91, by the second column NA, — C, = 4.10, and simi- 
larly by the third 12.92, suggesting that, if we were in doubt, it would 
be worth while scrutinizing, as a start, the entries in the third column. 
But, faced with data in the form of a contingency table, what we are 
primarily interested in, as computers, is the value of x’. If the latter is 
satisfactory, we usually do not wish to go any further, and therefore any 
method of computation which saves time is, in the long run, an ad- 
vantage. 


AN ANALYSIS OF RECTANGULAR LATTICES WITH UNEQUAL 
BLOCK SIZES, USING INTER-BLOCK INFORMATION 


PaMELA M. CLARKE 


National Institute for Research in Dairying 
Shinfield, England* 


1, INTRODUCTION 


; we square lattice design (2) and the three-dimensional cubic lattice 
design (7) are available for variety trials and experiments with large 
numbers of treatments; both designs permit recovery of inter-block 
information. Yates (6) has also given the design and analysis of a 
rectangular lattice for pq varieties, using blocks of unequal sizes, but 
without recovery of inter-block information. The present paper extends 
the analysis to recover this information, using the assumption that the 
inter-block and intra-block components of variance are little affected 
by the unequal block sizes. 

In practice p and q need seldom differ by more than one, and when 
q = p + 1 the rectangular lattice design with blocks of equal size, 
developed by Harshbarger (4), is preferable, using the method of 
analysis given by Cochran and Cox (1), (ef. also Grundy, 3). The un- 
equal block design may, however, be useful in some cases where the 
difference between p and gq exceeds unity, but is neither greater than 5 
nor greater than 10% of pq. 


2. DESIGN 


Throughout the paper, the design is considered in terms of a variety 


*Most of this paper was written while the author was on the staff of Rothamsted Experimental 
Station. 
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trial. In applying the method to other experiments, “treatments” 
should be read for ‘“‘varieties”’. 

The design is applicable to pq (p < q) varieties with an even number, 
2n, of replications. The varieties are allotted at random to the pq inter- 
sections of a p X q lattice with p horizontal lines and q vertical lines, 
and each variety may then be denoted by two numbers u and v, where 
the variety at the intersection of the ith vertical line and jth horizontal 
line is given by u = 7, v = j, as shown in Fig. 1. 


1l 12 13 14 v 
Fia. 1. 


21 22 23 24 ¥ 


31 32 33 34 


The experiment is then laid out on the ground so that one of each 
pair of replications is made up of g blocks of p plots each, where the sets 
of p varieties defined by the vertical lines v = constant are allotted to the 
q blocks at random. Such replications will be referred to as the X group. 
In the other replication, denoted by Y, of each pair there are p blocks of 
q plots each, the sets of q varieties defined by the horizontal lines 
u = constant again being allotted to the blocks at random. The posi- 
tions of the varieties within each block are randomized. The group of 
blocks forming a complete replication should be arranged as compactly 
as possible. 


3. STATISTICAL ANALYSIS 


3.1. General method. 


The method of analysis is similar to that described in (2) for the 
square lattice, and when p is put equal to g the formulae given here 
reduce to those for the square lattice. Any varietal comparison may be 
expressed in terms of mutually orthogonal components, some of which 
are independent of block differences. For the remainder of the com- 
parisons there are two available estimates, one from inter-block com- 
parisons and one from intra-block comparisons; the analysis affords an 
estimate of the relative accuracy of the two types of comparison, and a 
weighted mean of the separate estimates is obtained. 

It is evident that the two groups X and Y will have different within- 
block and between-block components of variance, since in one group 
each replication is made up of q blocks of p plots each and in the other 
group each replication contains p blocks of g plots each. If, however, 
the value of (¢ — p) is not large, and if the variability of the soil is not 
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very different for the two groups of replications, it is possible to ignore 
these differences in variances and use both groups to give pooled esti- 
mates of the components of variance. A brief investigation of the slight 
bias introduced into the estimates of error variance by this assumption 
of equality is made in section 4; the bias is likely to be small, and it de- 
creases as the number of replications is increased. 


3.2. Notation and lay-out of computational tables. 


Individual plot yields are set out in the appropriate places in 2n 
(p X q) tables on the scheme of Fig. 1. Similar tables are drawn up for 
varietal totals over group X and over group Y, and for varietal totals 
over the whole experiment. 

The following notation is used, with r = 1, --- = 1, , p; 
j = 1,---,4q. In group X the block totals are denoted by X;, , the 
replication totals by R,, , and the total of group X by T,. Similarly 
Y,, (block total), R,, (replication total), and 7, (total) refer to group Y. 
A typical column (block-type) total in group X is denoted by 
V,; (= 2, X;,), and the total of the same varieties in group Y by V,,; ; 
a row (block-type) total in group Y is denoted by U;; (=2Z, Y;,), and 
the total of the same varieties in group X by U,,;. Also T;; denotes the 
total for variety u = i, v = j over all replications, and L; = U,; — U,;, 
M; = V.; — Vo; , Tr = TL, = —=M;. 


3.3. Weighted estimates of means. 


The yield of any plot, apart from varietal effects, may be regarded as 
the sum of two normally distributed independent components, one of 
which varies from plot to plot, with variance o”, while the other is con- 
stant for all plots in a block and varies from block to block with variance 
o’”, assuming these variances to be unaffected by the different block 
sizes in the two groups of replications. 

In the same way as for the square lattice (1), we may obtain a 
weighted estimate, T{; , of the varietal total T;; , given by 


Ti; = Ti; + — T1)/pa + ».(qM; + T1)/paq, (3.1) 
where 


A= (W- + (3.2) 
and 
1/W =o’, 1/W! = 0 + po” and 1/Wi =o? + qo” (3.3) 


The correction terms 6; and ¢; , given by 
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6; = A(pL; — T1)/pa, e; = 4.(qM; — Tu)/pq (3.4) 


are calculated using the values of \, and X, given by the analysis of 
variance. They are set down in the margins of the table of unadjusted 
varietal totals, and are added respectively to the totals 7;; to give the 
adjusted varietal totals T{; , which are then divided by 2n to give the 
adjusted varietal means. The relations 25; = 0 and Ze; = 0 may be 
used as computational checks. 


3.4. Analysis of variance. 


The form of the analysis of variance is as shown in Table 1 below. 
The varietal effects are eliminated from the sums of squares for blocks 
in order to estimate \, and X, . 


TABLE 1. 
Source of variation Degrees of freedom Mean square 

Replications 2n 
Blocks (eliminating varieties) 

Component (a) (n — 1)\(p +q 

Component (b) pt+q-1 
Varieties pq 
Intra-block error pq(2n — 1) —n(p +Qq) +1 E 

Total — 1 


All the sums of squares, except those for blocks, are obtained in the 
usual way. 


_ Component (b) of the sum of squares for blocks is computed as 
{(2Li)/2ng} + {(2M})/2np} — {2(T7/2npq)} 


When n = 1, component (a) vanishes, and when n = 2, this sum of 
squares is most easily computed by taking the sums of squares for devia- 
tions of (X;, — X;.) and (Y;, — Yj,2). Ih this case the values of 
(Xj, — Xj2) and (Y,, — Y,2) are included in the computation tables, 
and we have 


component (a) = {2 (Xj, — Xj2)?/2p} + {= (Ya — Yia)?/2q} 


— {@a — Ra)’ + Rn — R,2)"}/2pq. 


If n > 3, it is necessary to draw up an auxiliary table of sums of 
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squares in order to compute component (a), as shown in Table 2. In this 
case it is advisable to record the correction term (7T'7/2npq) for the quan- 
tities L and M. The sum of squares for replications can then conven- 
iently be obtained as the sum of (7'2/2npq) and the two items (ii) of the 
auxiliary table. 


TABLE 2. AUXILIARY TABLE OF SUMS OF SQUARES (n > 3) 


Group X 
af. S.s. 
(i) Correction for mean T.?/npq 
(ii) Replications n-1 (2,Rz-*)/pq — (i) 
(iii) Block-types q-1 — (i) 
(iv) Total nqg—1 — (i) 
Block component (a) (n — 1)(q — 1) (iv) — (ii) — (iii) 
Group Y 
d.f. 
(i) Correction for mean T,?/npq 
(ii) Replications n—1 (2,R,,*)/pq — (i) 
(iii) Block-types p-l (2iU si?) /ng — (i) 
(iv) Total np —1 (2i2-Y¥is?)/q — (i) 
Block component (a) (n — 1)(p — 1) (iv) — (ii) — (iii) 


3.5. Estimation of the weights. 


Let C denote the mean square for block components (a) and (b) com- 
bined, and £ the intra-block error mean square. Then E is an estimate 
of o’, and C is an estimate of o° + (2n — 1) ywo”/2n, based on 
n(p + q — 2) degrees of freedom, where 


_ 


Hence we have as estimates of W, Wi and W/, 


W =I//E, 
= (2n — 1)u/{[Qn — — + 2npC}, (3.6) 


Wi = (2n — 1)u/{[(2n — — 2ng]E + 2ngC}. 


. 
ge 
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\, and A, may now be computed, using equation (3.2). 
If C is less than E, Wi and Wj; should be taken as equal to W, and 
this leads to an analysis of complete randomized blocks of pq plots each. 


3.6. Variances of the differences between adjusted varietal means. 


The variance of the difference between two adjusted varietal means 
will depend on whether the two varieties occur in the same block in the 
X group, in the same block in the Y group, or never in the same block. 
When (g — p) is not greater than 5% of pg, it is usually sufficient to pool 
the variances in the three cases to give a standard error for use in making 
varietal comparisons. 

The variance in each of the three cases may be derived by expressing 
the difference between two adjusted mean yields in terms of orthogonal 
between-block and within-block comparisons, giving the following vari- 
ances: 


Two varieties in the same block of p plots (1 + A,/q)/nW 
Two varieties in the same block of q plots (1 + d./p)/nW 
Two varieties not in the same block (1 + A,/q + A./p)/nW 


Average 


4. BIAS IN ESTIMATE OF VARIANCE 


The method of analysis described uses the assumption that o” and o” 
are little affected by differences in block size. If the variances in the two 
groups of replications are not equal, the estimate of the error variance 
of a difference between two varietal means will be biased. An expression 
for the percentage bias was obtained, and values calculated using uni- 
formity trial figures for mangolds and wheat (Mercer and Hall, 5). In 
general the estimated error variance is an underestimate of the true 
value. Where (g — p) was not large and was not more than 10% of pq, 
the percentage bias obtained was extremely small: for one pair of repli- 
cations, with g — p < 5and (q — p)/pq < 0.1, the maximum calculated 
value was 0.95%, and in many cases the percentage bias was less than 
0.2%. The bias decreased as the number of replications was increased. 


5, RELATIVE EFFICIENCY OF THE DESIGN 


It may be shown, as for a quasi-factorial design with equal blocks 
(7), that the rectangular p X q lattice design may be analysed as if it 
were an arrangement in ordinary complete randomized blocks of pq plots 
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each. The efficiency of the lattice design relative to an arrangement in 
complete blocks of pg plots may be obtained from the ratio of the 
variances of the difference between two treatment means for each design. 
Table 3 below gives values of the percentage relative efficiency, with and 
without recovery of inter-block information, for some values of p and q 
and tabulated for various values of 6, where 


— + — 2). (6.1) 


6 is equivalent to w/W, where W = 1/(o” + yo’) and yu is given by 
equation (3.5). 


TABLE 3. PERCENTAGE EFFICIENCY OF THE LATTICE DESIGN RELATIVE TO RAN- 
DOMIZED COMPLETE BLOCKS. (THE UPPER AND LOWER FIGURES GIVE THE 
RELATIVE EFFICIENCY WITH AND WITHOUT RECOVERY OF INTER-BLOCK 

INFORMATION RESPECTIVELY.) 


p q Pq ¢@=1 2 3 4 6 8 10 
5 6 30 100.0 | 104.7 | 113.5 | 123.6 | 145.4 | 168.1 | 191.2 
76.3 | 88.2} 100.0 | 111.8 | 185.5 | 159.2 | 189.2 
6 9 54 100.0 | 103.8 | 110.0 | 119.4 | 137.4 | 156.2 | 173.5 
80.3 | 90.2} 100.0 | 109.8 | 129.5 | 149.2 | 168.2 
Z 10 70 100.0 | 103.4 | 109.9 | 117.4 | 183.7 | 150.7 | 168.1 
82.1 | 91.1 | 100.0 | 108.9 | 126.8 | 144.6 | 162.5 
8 oe | 88 100.0 | 103.1 | 109.0 | 115.8 | 130.7 | 146.2 | 162.1 
83.6 | 91.8 | 100.0 | 108.2 | 124.5 | 140.9 | 157.2 


The values in Table 3 were calculated ignoring the slight bias dis- 
cussed in the previous section, and the small loss of information due to 
inaccuracy of weighting. 


6. SUMMARY 


The paper extends the analysis of the two-dimensional p X q lattice 
design, with blocks of unequal size, to recover inter-block information 
when (q — p) is neither greater than 5 nor greater than 10% of pq. 

The method of analysis may be usefully extended for the three- 
dimensional lattice with pqr varieties; a description of the appropriate 
analysis is in preparation. 

I am indebted to Dr. Yates for suggesting the subject of this paper, 
and to Dr. P. M. Grundy and Mr. C. P. Cox for their help, criticism and 
encouragement. 
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SOME COMMENTS ON “ESTIMATES OF THE LD,): 
A CRITIQUE” 


JEROME CORNFIELD AND NATHAN MANTEL 


National Cancer Institute, Public Health Service, Federal Security Agency 


ly THE December 1950 issue of Biometrics [1] Bross presented a com- 
parative study of three methods of estimating the LD;, , Karber’s 
method, the Reed-Muench method and the method of maximum likeli- 
hood. Dr. Bross’ general conclusion on the comparative advantages of 
the first and third of these methods is ‘‘The Cornfield-Mantel iterative 
approximations to the maximum likelihood estimate do not improve the 
accuracy of the Spearman-Karber’s method. The additional work seems 
to be wasted insofar as the LD;, is concerned.”’ 

Dr. Bross’ results and conclusions are based upon successive samples 
drawn from known populations. The only source of practical interest in 
the problem, however, arises when samples are drawn from unknown 
populations. In most studies in experimental sampling, the conclusions 
drawn for known populations can be applied without modification to 
unknown populations. In the present case, because of the special charac- 
teristics of Karber’s method, this unfortunately cannot be done. In 
fact, the situation appears to be: 


(a) In sampling from known populations there are situations in 
which Karber’s method is definitely superior to the method of 
maximum likelihood as shown by Bross’ work. 


(b) In sampling from unknown populations there are no situations 
in which Karber’s method is superior, although there are some in 
which it is just as good. 
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In deciding which method to apply to a set of laboratory data, conclusion 
(b) rather than (a) is of course governing. 

This situation arises naturally from the nature of Karber’s method. 
To use it, it is necessary to assume that at the dose level next below the 
lowest in the experiment, the true response is zero percent, while at the 
dose level next above the highest the true response is 100 percent. Thus, 
in sampling from a population with known true proportions at successive 
dose levels of .10, .32, .68 and .90, Dr. Bross assumes for all samples a 
response of zero percent for the unobserved next lower dose level and a 
response of 100 percent for the unobserved next higher dose level. This 
procedure introduces no error in the estimate of the LD;. but is the 
equivalent of adding infinite sized samples at two extreme dose levels to 
supplement the animals available to maximum likelihood. Karber’s 
method, consequently yields superior results in this case. 

Suppose now we treat the same data as if the samples were drawn 
from an unknown population. With two animals at a dose level four 
situations may arise: 


(a) The lowest dose level yields zero percent response and the 
highest one-hundred percent. With two animals at a dose level, 
this will occur in about two-thirds of the samples. 


(b) The lowest dose level yields zero percent response, but the high- 
est yields a response of 50 percent or less. This will occur in 
about fifteen percent of the samples. 


(c) The highest dose level yields 100 percent response but the lowest 
yields a response of 50 percent or more. This will occur in about 
fifteen percent of the samples. 


(d) The lowest dose level yields a response of 50 percent or more; 
the highest 50 percent or less. This will occur in about four 
percent of the samples. 


It seems clear that if one did not know the nature of the population one 
would at most use Kiarber’s method in situation (a) but not in the others. 
Thus, if the lowest observed dose level yields a sample response of 50 
percent or more, situation (c) or (d), one would be leaving the realm of 
statistics and entering that of personal judgement (about which nothing 
general can be said) if one assumed a zero percent response at the next 
lower dose level. 

It is therefore pertinent to ask how Kiarber’s method would compare 
with that of maximum likelihood separately for these four situations. 
Dr. Bross has not published his data in a form that can provide an answer 
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to this question, nor have we attempted to recompute his results. We, 
nevertheless, believe that in situation (a) the margin of superiority of » 
Karber’s method is slight, while in (b) and (c) it is considerable. This 
is a purely factual question, and if we are wrong, we hope Dr. Bross will 
call it to our attention. It follows that the superiority of Kérber’s 
method, as shown in Dr. Bross’ Graph I, arises almost entirely from 
situations (b) and (c), in which one could not apply Karber’s method if 
one did not know the nature of the population sampled. 

Dr. Bross also considers an asymmetric situation with known true 
proportions of .59, .86, .96 and .99, with both two and five at a dose level. 
In this case there are very few situations in which one would assume zero 
percent response at the next lower dose level if the population were 
unknown. Thus, with five at a dose level, only one percent of the 
samples would yield a lowest dose level with a zero percent response, 
while over 30 percent would yield a lowest dose level with a sample 
response of 80 percent or more. It is in this latter situation, however, 
that Karber’s method gives the best results when sampling from a 
known population. This is so because it compels the estimated ED 50 
to lie above the dose level to which zero percent response is imputed, 
whereas any estimator which uses only sample information will, for this 
situation, tend to estimate it below this dose level, and hence farther 
away from the true value. As in the symmetric situation, therefore, the 
superiority of Karber’s method arises almost entirely from those situa- 
tions in which one could not use it if one did not know the nature of the 
population sampled. 

There is a complication that arises in the asymmetric case which 
casts further light on the matter. In the symmetric case, Dr. Bross 
assumes zero mortality at a dose level at which the true mortality is 
.025. In the asymmetric case, zero mortality is assumed at a dose level 
at which the true mortality is .26. In the latter case, therefore, the 
assumption is in serious error and results, as Dr. Bross points out, in an 
upward bias in Kirber’s estimate of the ED50. Because the estimate 
must lie above dose level 0, it also results in a lower sampling variance. 
Dr. Bross’ results with Kirber’s method consequently reflect the net 
effect of a large bias and a low variance. With two animals at a dose 
level, the effect of lowering the variance predominates, and Kirber’s 
method gives uniformly superior results. With five at a dose level, the 
effect of the increased bias begins to appear: the method of maximum 
likelihood gives values within two-tenths of a dose level of the true value 
more frequently. Because of its larger variance, it also gives values 
within four-tenths of a dose level of the true value less frequently. It is 
easy to show that in the limit, Karber’s method will overestimate the 
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ED50 by about two-fifths of a dose level in this situation. This indi- 
cates that as sample size increases beyond five at a dose level, the bias 
effect in Kiarber’s method becomes more and more important. We 
would conjecture that with sample sizes of ten or fifteen at a dose level, 
the maximum likelihood estimate will be superior for all values of 6 less 
than or equal to .5 of a dose level.’ This indicates that even when 
sampling from known populations, if the configuration of dose levels is as 
asymetric as .59, .86, .96 and .99, Kiarber’s method may not be superior 
for the size of sample usually encountered in practice. This is a purely 
academic observation, however, since we are concerned only with sam- 
ples from unknown populations. 

We would summarize our observations in the following comment: 
When sampling from unknown populations we always confine ourselves 
to estimators which are in some sense good, no matter what the unknown 
population value is. Karber’s estimate is not a consistent estimator when 
the imputed values of zero and 100 percent are incorrect. In practise, 
therefore, we use it only when the sample data indicate quite clearly that 
the imputed values are correct, and in this case, we have no gain in 
accuracy. To those who do not accept this limitation, we propose an 
estimator which is superior to both maximum likelihood and Karber’s 
method: the arithmetic average of the dose levels assumed to yield zero 
and 100 percent response. It will have zero variance and zero bias for all 
symmetric configurations of dose levels. 


1]t can never be superior for all values of 6 since Kiirber’s estimate must lie between the dose levels 
assumed to yield zero and 100 percent response, whereas, no such constraint exists for the maximum 
likelihood estimate. 


REFERENCES 
[1]. Irwin Bross, Estimates of the LDso : A Critique. Biometrics, 6, 413, 1950. 


QUERIES 


GrorceE W. SNEpDEcoR, Editor 


QUERY: [have encountered a difficulty which I am unable to 
90 explain. A randomized blocks experiment yielded the six treat- 
ment means: A, 26.0; B, 22.8; C, 30.0; D, 20.2; E, 24.0; F, 30.4. 

The analysis of variance gives the mean squares, “ 


Treatments, 5 d.f., 82.11 
Error, 20 d.f., 26.48 


The value of F is at about the 4% level of significance. But if I test the 
significance of the difference between the extreme means, 30.4 — 20.2 = 
10.2, by using ¢, I get P less than 0.01. I have understood that these two 
tests are the same. Can you tell me why the results are different? 


The only circumstance in which the two tests are identical 
ANSWER: _ is that in which the number of treatments is 2. In your 

experiment the null hypothesis tested by F is that the 
population means are all the same. If this hypothesis is rejected, the 
alternative hypothesis accepted is that 2 or more of the population means 
are different. There is nothing in the accepted hypothesis that specifies 
which of the means are different or how much they differ. 

The (-test produces correct probabilities only if the 2 means are 
randomly selected. It cannot be applied to the difference between means 
which are chosen because their difference happens to be large. This 
matter is discussed by Fisher in ‘‘Design of Experiments’’, section 24; 
by Snedecor in “Statistical Methods’, section 15.5; and in “‘Queries’’, 
this journal volume 1:26(1945) and volume 5:250(1949). An empirical 
method for handling a set of means such as yours is discussed by Tukey 
in this journal, volume 5:99-114 (1949). 
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ABSTRACTS 


Joint Meeting 


Institute of Mathematical Statistics 
and 
The Biometric Society, Western North American Region 


154 G. A. BAKER and R. E. BAKER. Uniformity field trials with 
strawberries. 


An empirical study of the correspondence of some uniformity trials 
of strawberry yields to the classical mathematical models shows that in 
many cases such models are not sufficiently realistic. With certain 
arrangements the agreements are better, but it appears that these set-ups, 
which involve going across rows instead of down rows, increase the cost 
and liability of errors in the records. Mere increase in size of plots does 
not guarantee improved results. 

The correlations between yields of plants in adjacent or near posi- 
tions are low. This is in contrast to the results published for wheat and 
the general assumption that plants close together yield alike. Careful 
selection of plants before setting out might help this situation. 


DOUGLAS G. CHAPMAN. (Department of Mathematics 
155 Laboratory of Statistical Research, University of Washington.) 
Inverse and Multiple Zoological Sample Censuses. 


If a single group of tags is placed in an animal Population and samp- 
ling is then begun and continued until a fixed number of tags are recov- 
ered, the sample size n is a random variable. Interval and point esti- 
mates of the total population size based on n are studied; further, 
optimum census designs based on this procedure are considered. Esti- 
mation procedures are considered for sample censuses where several 
groups of tags are placed (a “multiple sample census”’): in this case the 
possible types of censuses are manifold: taggings cumulative or taggings 
independent, sampling procedures direct or inverse, sampling with 
replacement or without replacement. Besides the usual information 
available for estimation and testing purposes it is possible in some 
multiple sample censuses to test some of the underlying assumptions. 
Tests of this nature are examined. Finally some sequential multiple 
procedures are considered. 
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156 C. H. WADLEIGH. (U. 8. Salinity Laboratory, Riverside, 
California). Multiple Regression Analysis of Soil Data. 


The degree to which various physical and chemical attributes of soil 
condition water transmissivity needed to be evaluated. Simple relations 
between permeability to water and the pH, exchangeable sodium content, 
saturation percentage, and settling volume appeared to be hyperbolic; 
and logarithmic transformations effected approaches to linearity. It was 
evident that a high degree of intercorrelation prevailed among the inde- 
pendent variables indicating the need for analyses by multiple regression. 
By taking into account gypsum content of the soil as a non-quantitative 
variable and the use of multiple curvilinear regression on other inde- 
pendent variables, 88 percent of the variance in data for log permeability 
of soil to water was accounted for. 


W. C. ROLLINS and C. E. HOWELL. Genetic and Environ- 
157 mental Sources of Variation in the Length of the Gestation Period 
for the Horse. 


An estimate has been made of the relative importance of certain 
genetic and environmental sources of variation in the length of the 
gestation period for horses. Involved in the analysis was an attempt to 
evaluate the importance of sex-linked genes of the foetus in conditioning 
the length of time it is carried by the mare. 


THEODORE M. WIDRIG. Estimating Tolerance Limits for 
158 the Age Composition of a Population of Fish, as Derived from 
Random Samples Within Length Strata. 


Estimation of age of fish with consideration of their corresponding 
length sometimes leads to easier age determinations than estimation 
without regard to length. Some fishes have been ‘‘aged”’ by considera- 
tion of length alone, while with others the growth rate is such that no 
reliable age-class modes appear in samples of the length frequency 
distribution of the fish population. 

A method for treating a stratified sample is given, which involves 
the variance of the product of two independent random variables. 
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159 B. M. BENNETT. (University of Washington, Seattle). Esti- 
mation of Median Effective Dose by Moving Average Methods. 


W. R. Thompson (Bact. Rev. 1947) has proposed that moving aver- 
age methods be considered as a basic estimation procedure for LD-50 in 
bioassay problems. This paper examines the theoretical basis and the 
efficiency of moving average methods when in particular the distribution 
of tolerance or threshold may be assumed to be the integrated normal 
or the logistic. Weighted moving averages are suggested as a more 
efficient estimate for LD-50. For example, in the case of unequal num- 
bers (= n;) of animals observed at each dose level the weighted 3 term 
moving average sequence 


BP, + = 2,---,k— 1) 
nN; 
1 
where i, = (4+ +), 


provides a more efficient interpolation estimate of LD-50 generally. 


FRANK B. CRAMER and FREDERICK J. MOORE. (Los 
160 Angeles Biometrics Service, Pasadena 2, and Dept. of Experi- 
mental Medicine, University of Southern California, Los Angeles). 

A Graphic Means of Estimating Means and Standard Deviation. 


The adaptation of the probit technique to mensurative data has 
vielded a very rapid means of estimating means and standard deviation. 
If the probit values of a group are calculated by equilibrating the group 
to a first and second moment discrete point representation of a normal 
population, the regression of observed values upon the probit value may 
be shown to have a slope approximating the standard deviation and a 
central origin at the mean. If the variance of the observations from the 
regression line is less than 25% of the total variance, then the slope 
differs from the standard deviation by less than 4%. 

The utilization of probability paper with probit scale superimposed 
is very convenient for the graphic utilization of this method. For groups 
of 8 or larger, the probit values as calculated above will be adequately 
approximated graphically when accumulated probability is calculated as 
(Rank — 1/2)/total number in the group. 
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THE BIOMETRIC SOCIETY 


Italian Meeting: The first Italian meeting of The Biometric Society was 
held on March 14th at the University of Milan with the following 
program: 


G. 
V. 
R. 


E. 


Barbensi—‘‘Development of biometry in Italy” 
Tonolli—“‘Statistical studies on plankton”’ 

Scossiroli—‘‘Some results of biometry applied to maize cultiva- 
tion” 

Baldacci and G. Orsenigo—‘‘Statistical investigation of the size of 
ascospores of Endothia species in relation to the systematics of 
this genus” 


. DeBarbieri—‘‘On the correlation between insulin hypoglycaemia 


and convulsion rate” 


. Gallo—“On a method for the estimation of the average life of 


Basophilic erythroblasts” 


. Di Vita—“‘Lunar periodicity of epileptic fits” 
. Cavalli—“Application of maximum likelihood to bacterial disin- 


fection curves” 


WNAR Meeting: The Western North American Region met jointly 
with the Institute of Mathematical Statistics at The RAND Corporation, 
Santa Monica, California, on June 15 and 16. The morning session on 
Friday the 15th was under the chairmanship of Mary Elveback and 
considered Applications of Statistics in Biology. The following papers 
were given: 


Uniformity Yield Trials with Strawberries 


G. A. Baker and R. E. Baker 


Inverse and Multiple Zoological Sample Censuses 


D. G. Chapman 


Multiple Regression Analysis of Soil Data 


C. H. Wadleigh 


Genetic and Environmental Sources of Variation in Length of Gesta- 


tion Period for Horse—W. C. Rollins and C. E. Howell 


Estimating Tolerance Limits for the Age Composition of a Popula- 


tion of Fish as Derived from Random Samples within Length 
Strata—T. M. Widrig 


An afternoon session under the Chairmanship of Dr. Norman B. Nelson 
considered Statistics in Medical Research and Public Health. The fol- 
lowing were presented: 
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Sampling of Ophthalmological Data 
F. W. Weymouth and M. J. Hirsch 
Human Heart Weight: A Study Based on 20,000 Autopsies 
E. Bogen 
Estimation of Median Effective Dose by Weighted Moving Average 
Methods—B. M. Bennett 
A Graphic Estimate of Means and Standard Deviations 
F. B. Cramer and F. J. Moore 
Law of Mass Action as Applied to Tetanus Incidence 
J. T. Oliver 


At the annual business meeting of the Region on June 16th, the 
following regional officers were nominated: G. A. Baker, vice-president, 
W. C. Rollins, secretary-treasurer, and F. W. Weymouth and H. G. 
Romig, regional committee members for 1951-53. An official invitation 
to meet at the University of Oregon in June 1952 was accepted unani- 
mously. The attendance at the above sessions and at those arranged by 
the Institute of Mathematical Statistics numbered about 100, of which 
19 registered as members of The Biometric Society. 


Netherlands Meeting: A joint meeting of the Society with two other 
biological and statistical groups, Medisch-Biologische Sectie v./d Ver. v. 
Statistiek and Studiekring voor Proeftechniek, was held at Leiden on 
June 15 under the chairmanship of Dr. 8. T. Bok with the following 
program: 
8. T. Bok—‘‘Developments in the Netherlands in biometry”’ 
C. A. G. Nass—“The status of examinations for biometricians” 
N. H. Kuiper—“Graphic demonstrations” 
_ J. Hemelrijk—“The comparison of random samples from discrete 
distributions” 
Th. D. J. Erlee—‘‘From the practice of biometry” 
M. Keuls—‘‘Analysis of variance” 
Ir. H. de Mirande—‘Different measures for the reproducibility of 
chemical analytical data’’ 


ENAR Meetings: Three meetings have been scheduled for later this 
year. There will be two joint sessions with the Statistics Section of the 
American Public Health Association, on methodology in chronic disease 
morbidity studies and in long-time follow-up studies, during the APHA 
annual meeting on October 29 to November 2 in San Francisco. The 
Region is planning several sessions during the AAAS meetings in Phila- 
delphia on December 27 to 29, including a symposium to be held jointly 
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with Section H of the AAAS on statistical models in human population 
genetics and one or more programs of contributed papers. The Annual 
Meeting of the Region will be held in Boston, also on December 27 to 29, 
in cooperation with the Biometrics Section of the American Statistical 
Association and with the Institute of Mathematical Statistics. In addi- 
tion to contributed papers, sessions are being planned on animal experi- 
mentation, statistical problems in epidemiology, use of the range, mathe- 
matical biology, and administrative uses of statistics in medical programs. 
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NOTES 
“I TITIS” 


I want to call attention to a fault that has become very common in 
the writing of scientific papers—excessive use of the indefinite pronoun 
IT. I have frequently encountered papers in which, for several con- 
secutive paragraphs, about half the sentences begin “It is clear that .. .”, 
“It becomes apparent that . . .”, “It therefore seems that . . .’’, ‘Hence 
it is impossible to. . .”” and soon. Even monsters like “It is to be ex- 
pected that it will be difficult to apply A unless it is accompanied by B, 
for which reason it is generally preferable to use C in spite of its other 
disadvantages” are sometimes met. 

This construction seems to me one of the weakest forms of expression 
in our language. It (here a definite ‘it’!) may be convenient when no 
emphasis is wanted, or in order to avoid a more clumsy turn of phrase. 
Too often; however, it is itself clumsy, but is adopted by the author who 
wants either to avoid explaining ideas that he has not fully thought out 
for himself or to insinuate his opinions without accepting responsibility 
for them. Confusion is added to clumsiness when, as in my example 
above, a sentence contains both an ‘it’ referring to a noun that is ex- 
pressed and one or more indefinite uses of the word. How much clearer, 
stronger, and in every way better to write ‘In spite of its disadvantages, 
C is generally preferable, because application of A without B is difficult”! 
The same idea is now stated in 17 words instead of 35, and emphasis is 
given to the most important point, the use of C. 

One reason for the modern disease of ‘ititis’ is perhaps the dislike 
of authors (and editors) for personal pronouns. This also manifests itself 
occasionally in excessive use of ‘“‘There are . . .”” constructions and of 
passive verbs. In moderation, none of these is objectionable, but ex- 
cesses rapidly appear in the work of a careless writer or of one who writes 
as though the reader could assimilate what he says without caring how 
he says it. In statements of fact, a simple rearrangement of a sentence 
(as illustrated above) usually permits what was done or observed to be 
said directly without an indefinite pronoun. The temptation to escape 
from personal pronouns by the “It is . . .”” construction is stronger when 
opinions are stated, yet surely the need for correct attribution of re- 
sponsibility is then greater. Indeed, I suggest that an author who wants 
to express his own opinion, belief, or recommendation might with ad- 
vantage return to an older habit of scientific writing, and might fearlessly 
assert ‘‘In spite of its disadvantages, I recommend C, because application 
of A without B is difficult.” Not only would his paper then gain in 
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precision, clarity, and conciseness: he would no longer permit himself 
to evade his duty of thinking out exactly what he wants to say, and of 
choosing words that convey to the reader neither more nor less than that. 


L.LD.AS.E., D. J. FINNEY 
91 Banbury Road, 
Oxford, 

28 June 1951. 
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